[ieee 2012 19th working conference on reverse engineering (wcre) - kingston, on, canada...

10
Astra: Bottom-up Construction of Structured Artifact Repositories Joel Ossher Hitesh Sajnani Cristina Lopes Donald Bren School of Information and Computer Sciences University of California, Irvine Irvine, California USA {jossher,hsajani,lopes}@ics.uci.edu Abstract—Structured artifact repositories, such as Java’s Maven Central Repository, provide developers with the ability to easily locate and manage their programs’ external dependen- cies. Unfortunately, these artifact repositories are often missing popular libraries, dramatically decreasing their usefulness. Our investigation of the scope of the Maven Central Repository showed that it is missing over 60% of externally referenced types in a large collection of open source Java projects. Artifact repositories, such as the Maven Central Repository, are manually curated, and constructed in a top-down manner. This makes expanding their scope, and keeping them up to date, a time consuming process. To address this issue, we present Astra, an algorithm for the automated bottom-up construction of structured artifact repositories. Astra takes as input a collection of unknown library artifacts, such as jar files, and returns a structured artifact repository. The resulting repository contains libraries, which are divided into versions and associated with individual artifacts. This bottom-up construction is accomplished by analyzing the co-occurrence of types between the library artifacts. Astra was implemented for Java as part of the Sourcerer Infrastructure. Our evaluation demonstrates that Astra generates a repository with 77% similarity to the Maven Central Repository when provided the same base artifacts. An examination of the differences between the manually curated and generated repositories indicates that in many cases the generated structure has significant merit. I. I NTRODUCTION The growth of the open source software movement has enhanced the opportunities for reuse in software develop- ment by generating a large number high quality libraries [10]. Given that library use can decrease development time and improve product quality [6], the increased availability of libraries has resulted in their widespread use. The management of these external dependencies can be a significant challenge, and is complicated by libraries that themselves use other libraries. The result is that if a project depends on one library directly, it depends on many more indirectly. Locating all of these transitive dependencies can be difficult, as they are not always packaged with or clearly indicated by their consumer library. These transitive depen- dencies may also result in conflicts, as two libraries can depend on different incompatible versions of a third library. These difficulties are colloquially referred to as ”DLL hell” 1 , 1 http://en.wikipedia.org/wiki/DLL Hell which suggests the general sentiment surrounding them. Artifact repositories are a popular library management solution, as they provide a framework for collecting and or- ganizing external artifacts. Figure 1 shows an example hier- archical structure for a Java library artifact repository. At the top level, the repository contains libraries, such as junit. Each library is broken into multiple versions, such as junt 4.9 and junit 4.10. Finally, each library version is associated with some number of artifacts, in this case jar files like junit-4.9.jar and junit-4.9-dep.jar. Artifact repositories can be constructed locally, whereby developers add and categorize all the artifacts relevant to their projects. Numerous public artifact repositories also exist, both to support specific build systems and to pro- vide a single location from which developers can locate artifacts. These repositories exist for multiple programming languages, and languages have multiple repositories. For example, Java has the Maven Central Repository 2 , Python the Python Package Index 3 and C# the Refix Repository 4 . Despite their size, every one of these public artifact repos- itories is manually curated, and constructed in a top-down manner. This makes expanding their scope, and keeping them up to date, a time consuming process. While scope issues are somewhat ameliorated through user feedback, these repositories are regularly missing popular libraries. Our empirical evaluation of the scope of the Maven Central Repository shows that it is missing over 60% of externally referenced types in a large collection of open source Java projects. The full results are presented in Section II. This lack of adequate coverage makes it significantly more difficult for developers to resolve external dependencies. In Java, for example, there are a number of search engines specifically designed to help developers locate missing jars. jarFinder 5 and findJAR 6 both allow the user to enter the fully qualified name of a type and list every jar file in their indices that contains that type. GrepCode 7 , too, supports this functionality, in addition to letting users browse through 2 http://search.maven.org/ 3 http://pypi.python.org/pypi 4 http://repo.refixcentral.com/ 5 http://jarfinder.com/ 6 http://www.findjar.com 7 http://grepcode.com 2012 19th Working Conference on Reverse Engineering 1095-1350/91 $25.00 © 4891 IEEE DOI 10.1109/WCRE.2012.14 41

Upload: cristina

Post on 03-Mar-2017

214 views

Category:

Documents


1 download

TRANSCRIPT

Astra: Bottom-up Construction of Structured Artifact Repositories

Joel Ossher Hitesh Sajnani Cristina Lopes

Donald Bren School of Information and Computer SciencesUniversity of California, Irvine

Irvine, California USA{jossher,hsajani,lopes}@ics.uci.edu

Abstract—Structured artifact repositories, such as Java’sMaven Central Repository, provide developers with the abilityto easily locate and manage their programs’ external dependen-cies. Unfortunately, these artifact repositories are often missingpopular libraries, dramatically decreasing their usefulness. Ourinvestigation of the scope of the Maven Central Repositoryshowed that it is missing over 60% of externally referencedtypes in a large collection of open source Java projects.Artifact repositories, such as the Maven Central Repository,are manually curated, and constructed in a top-down manner.This makes expanding their scope, and keeping them up todate, a time consuming process.

To address this issue, we present Astra, an algorithm forthe automated bottom-up construction of structured artifactrepositories. Astra takes as input a collection of unknownlibrary artifacts, such as jar files, and returns a structuredartifact repository. The resulting repository contains libraries,which are divided into versions and associated with individualartifacts. This bottom-up construction is accomplished byanalyzing the co-occurrence of types between the libraryartifacts. Astra was implemented for Java as part of theSourcerer Infrastructure. Our evaluation demonstrates thatAstra generates a repository with 77% similarity to the MavenCentral Repository when provided the same base artifacts. Anexamination of the differences between the manually curatedand generated repositories indicates that in many cases thegenerated structure has significant merit.

I. INTRODUCTION

The growth of the open source software movement has

enhanced the opportunities for reuse in software develop-

ment by generating a large number high quality libraries

[10]. Given that library use can decrease development time

and improve product quality [6], the increased availability

of libraries has resulted in their widespread use.

The management of these external dependencies can be

a significant challenge, and is complicated by libraries that

themselves use other libraries. The result is that if a project

depends on one library directly, it depends on many more

indirectly. Locating all of these transitive dependencies can

be difficult, as they are not always packaged with or clearly

indicated by their consumer library. These transitive depen-

dencies may also result in conflicts, as two libraries can

depend on different incompatible versions of a third library.

These difficulties are colloquially referred to as ”DLL hell”1,

1http://en.wikipedia.org/wiki/DLL Hell

which suggests the general sentiment surrounding them.

Artifact repositories are a popular library management

solution, as they provide a framework for collecting and or-

ganizing external artifacts. Figure 1 shows an example hier-

archical structure for a Java library artifact repository. At the

top level, the repository contains libraries, such as junit.

Each library is broken into multiple versions, such as junt4.9 and junit 4.10. Finally, each library version is

associated with some number of artifacts, in this case jar

files like junit-4.9.jar and junit-4.9-dep.jar.

Artifact repositories can be constructed locally, whereby

developers add and categorize all the artifacts relevant to

their projects. Numerous public artifact repositories also

exist, both to support specific build systems and to pro-

vide a single location from which developers can locate

artifacts. These repositories exist for multiple programming

languages, and languages have multiple repositories. For

example, Java has the Maven Central Repository2, Python

the Python Package Index3 and C# the Refix Repository4.

Despite their size, every one of these public artifact repos-

itories is manually curated, and constructed in a top-down

manner. This makes expanding their scope, and keeping

them up to date, a time consuming process. While scope

issues are somewhat ameliorated through user feedback,

these repositories are regularly missing popular libraries.

Our empirical evaluation of the scope of the Maven Central

Repository shows that it is missing over 60% of externally

referenced types in a large collection of open source Java

projects. The full results are presented in Section II.

This lack of adequate coverage makes it significantly more

difficult for developers to resolve external dependencies. In

Java, for example, there are a number of search engines

specifically designed to help developers locate missing jars.

jarFinder5 and findJAR6 both allow the user to enter the

fully qualified name of a type and list every jar file in their

indices that contains that type. GrepCode7, too, supports

this functionality, in addition to letting users browse through

2http://search.maven.org/3http://pypi.python.org/pypi4http://repo.refixcentral.com/5http://jarfinder.com/6http://www.findjar.com7http://grepcode.com

2012 19th Working Conference on Reverse Engineering

1095-1350/91 $25.00 © 4891 IEEE

DOI 10.1109/WCRE.2012.14

41

�����

junit-4.10.jar

junit-4.10-docs.jar

���������

junit-4.9.jar

junit-4.9-dep.jar

������� �����

log4j-1.2.9.jar

log4j-1.2.9-bin.jar

�������

log4j-1.1.3.jar

log4j-1.1.3-src.jar

��������������

Artifact����� �

Artifact����� �

Figure 1. Example Artifact Repository Structure

the source code. Yet each of these search engines uses the

Maven Central Repository as the basis of their index. Any

additions are done manually, in the same way that artifacts

are added to Maven itself. Therefore if Maven does not

contain the library, then it will likely not be found.

This lack of coverage also potentially impacts tools that

use artifact repositories as a data source. These tools range

from systems for identifying artifact licensing information

[3], [5] to build systems that can automatically detect and

resolve missing dependencies [7]. Also, researchers looking

to study open source library ecosystems must aggregate them

themselves, lest they miss a significant number of libraries.

To address this coverage issue we developed Astra, an

algorithm for the automated bottom-up construction of

structured artifact repositories. Astra takes unknown library

artifacts, such as jar files, and returns a structured artifact

repository, as seen in Figure 1. While Astra can be applied to

any language which has a systematic method for packaging

reusable libraries, we focus specifically on Java.

Astra’s bottom-up construction is accomplished by an-

alyzing the types declared in each jar file to discover

patterns of co-occurrence. Astra was implemented for Java

as part of the Sourcerer Infrastructure. We evaluated Astra’s

effectiveness by comparing a repository it generated with the

Maven Central Repository, a pre-existing manually curated

Java artifact repository. We also generated a repository from

a collection of unknown archive files crawled from the

internet and compared this repository with the one built from

the curated artifacts. The evaluation shows that our algorithm

is effective at deriving structure from uncategorized input.

The remainder of this paper is structured as follows. In

Section II, we present a case study of the Maven Central

Repository, in which we evaluate its comprehensiveness.

Section III introduces Astra, and Section IV presents out

evaluation. After a discussion of related and future work in

Sections V and VI, the paper concludes in Section VII.

II. CASE STUDY: MAVEN CENTRAL REPOSITORY

The Maven Central Repository is a large collection of

categorized Java artifacts, containing many popular third-

party Java libraries. Maven’s structure is essentially that seen

in Figure 1. Each library, represented as a green rectangle, is

uniquely identified by its groupId and artifactId, and

can be divided into multiple versions. Each version of the

library, represented as a red rectangle, can be associated with

an arbitrary number of artifacts, such as jar files, represented

as blue rectangles.

The Maven Central Repository was created to support the

Apache Maven build system, but is also regularly used as

a data source for jar search and other software tools. While

the Maven Central Repository is associated with Apache, it

contains a wide range of open source projects.

Although the Maven Central Repository is quite large, it’s

coverage is not complete. This issue was first made apparent

during our evaluation of a tool for automatically resolving

missing dependencies, as the Maven-based repository repeat-

edly failed to find missing types [7].

We performed an analysis of the coverage of the Maven

Central Repository relative to the missing types found in

31, 047 Java projects downloaded from Apache and Google

Code Hosting. In addition, we evaluated the coverage of a

custom repository constructed from jar files found within

those Java projects. This allows us to assess the difference

in coverage between the Maven Central Repository and a

structured repository generated by Astra.

Table I contains general statistics on the two repositories.

We eliminated identical jar files within each repository, so

that each unique jar is only counted once. As seen in the

table, the Maven repository contains nearly three times more

jar files than the custom repository. However, on average, the

jar files from the Maven repository contain 163 class files,

while jar files from the custom repository contain 246. This

difference is likely due to developers merging jar files for

use within their own projects, which are then captured in

the custom repository.

The custom repository also contains over one million

more unique types than the Maven repository, despite having

ten million fewer class files. This suggests that the custom

repository has significantly better coverage. The ratio be-

tween the number of class files and the number of unique

types provides an upper bound on the number of versions

present of each type. For the Maven repository, this ratio is

17.1 class files per type, while for the custom repository the

ratio is 6.2. So while the Maven repository contains fewer

unique types than the custom repository, it provides more

versions of each type it does contain.

42

Maven Custom

Jar Files 170, 638 65, 673Class Files 27, 820, 025 16, 174, 930Unique Types 1, 626, 288 2, 596, 396Unique Packages 336, 374 484, 895

Table ICOMPARISON OF MAVEN AND CUSTOM REPOSITORIES

Package Name Count

java.util 25, 682java.io 22, 328java.net 10, 600java.text 9, 131java.awt 8, 670

Table IITOP 5 MOST IMPORTED INTERNAL PACKAGES

Package Name Count

javax.servlet.http 6, 160junit.framework 5, 863javax.servlet 5, 199org.junit 5, 088org.apache.log4j 3, 406

Table IIITOP 5 MOST IMPORTED EXTERNAL PACKAGES

This initial comparison suggests that custom repository

is more comprehensive than the Maven repository. Yet

we need to verify that the additional types the custom

repository contains are actually used. If those types are not

used, then the custom repository derives no benefit from

containing them. In order to verify usage, each of the 31, 047Java projects was examined to identify import statements.

The import statements were classified into two categories

according to where the type declaration could be found.

• Internal import: An import that references either a

type declared within that project’s source code (source

code found within that project’s version repository) or

a type in the Java Standard Library.

• External import: An import that is not internal.

1, 055, 345 unique types were imported by the 31, 047Java projects. Of these unique imported types, 165, 967were classified as external according to at least one project.

23, 645 projects (76.2%) imported at least one external type.

Tables II and III contain the top 5 most imported internal

and external packages. The count column shows the number

of projects importing a type in that package.

To evaluate the repositories’ coverage, we cross-

referenced the identified external types with the type declara-

tions found in each repository. Table IV presents the results

of this comparison. In total, there were 165, 967 external

types. The first two rows show that 66, 249 (39.9%) of those

types could be found in the Maven repository and 118, 783

(71.6%) in the custom repository. This large difference

confirms our initial findings that the custom repository has

significantly better coverage of external types. There is a

heavy amount of overlap between the Maven and custom

repositories, as seen by the third row being significantly less

than the sum of the first two. The forth row contains the

number of types that could not be found in either corpus.

Source Type Count Percentage

Maven 66, 249 39.9%Custom 118, 783 71.6%Maven ∪ Custom 123, 963 74.7%Nowhere 42, 031 25.3%

Total 165, 967 100%

Table IVCOVERAGE OF IMPORTED EXTERNAL TYPES

Source Type Count Percentage

Maven 10, 145 85.6%Project 11, 801 99.6%Maven ∪ Project 11, 807 99.7%Nowhere 38 0.3%

Total 11, 845 100%

Table VCOVERAGE OF POPULAR IMPORTED EXTERNAL TYPES

Comparing the number of unique types in Table I with the

total number of imported types in Table IV shows that the

vast majority of types never get imported. 4.1% of the types

in the Maven repository were imported at least once versus

4.6% of the types in the custom repository. Looking at the

overall coverage of jars, we found that 31.9% of Maven jars

and 75.4% of project jars had at least one type imported.

We were concerned that the advantage of the custom

repository was due to related projects referencing one an-

other, and therefore creating the appearance of missing

libraries. Such inter-project imports would show up as

external references, yet would not be used anywhere else. We

therefore filtered the external types by popularity, including

only those types that were imported by 10 or more projects.

Table V contains the results. There were a total of 11, 845types that were imported by 10 or more projects. The first

two rows show that 10, 145 (85.6%) of those types could be

found in the Maven repository and 11, 801 (99.7%) in the

custom repository. These results further confirm the differ-

ence in coverage, though the difference is less pronounced

for popular libraries. As expected, the filtered coverage is

much better than the coverage reported in Table IV, with a

smaller fraction of types that could not be found. Also, the

overlap between the two repositories is even greater, with

only 6 types being unique to the Maven repository.

43

III. ASTRA

This section describes Astra, an Automated STructured

Repository Algorithm. Astra’s goal is to construct a struc-

tured artifact repository, as shown in Figure 1, out of a

collection of unknown artifacts. We present Astra with

respect to the Java programming language. However, it can

be applied to any programming language where packaged

reusable libraries are used. Astra was implemented as part

of Sourcerer, an infrastructure for the large-scale indexing

and analysis of open source Java code. The implementation

can be found on Sourcerer’s Github page8.

The key insight behind Astra is that a library is not

composed of an exact and unchanging set of declared types.

Types will appear and disappear between versions, and

different distributions of the same version will often contain

different types. Despite this flux, however, there will usually

be an unchanging core of types. By grouping together types

that always co-occur, we can identify the unchanging cores

of libraries. These core clusters can then be expanded using

type versions to predict the inclusion of additional types.

Ultimately, the artifacts can be assigned to libraries based

on the patterns of type clusters that each artifact file contains.

Figure 2 shows the step-by-step process by which Astra

constructs an artifact repository. The input is a collection

of unknown artifacts, and the output a structured artifact

repository, as seen in Figure 1. The remainder of this section

covers each step in detail. Statistics on the intermediate

results produced by each step are presented using data from

a Maven-derived artifact collection, in which the 170, 638jar files from Maven were fed into Astra.

A. Model the Artifacts

Astra’s first step is to process and model the unknown

artifacts. Each artifact is modeled as a collection of declared

types, one for each class file in the jar, as seen in Part B

of Figure 2. This is done so that the types associated with

each artifact can be clustered based on their co-occurrence.

In Astra’s model, every declared type is identified by

its fully qualified name and its version. For a top-level

type, a fully qualified name (FQN) is the name of type’s

package concatenated with its simple name. For exam-

ple, junit-4.10.jar contains a version of the type

org.junit.runner.Runner while junit-4.4.jardefines a different version of the same type.

In building Astra, we experimented with a number of

different methods for determining type versions. Our first

approach was to directly compare the bytecode of two class

files. Unfortunately we found that this resulted in a number

of false negatives, where two identical versions of the same

declared type would be identified as different. Further inves-

tigation implicated difference between Java compiler options

8http://www.github.com/sourcerer/Sourcerer

and implementations as the root cause. Including debug

information, for example, alters the generated bytecode.

In order to make our version detection compiler inde-

pendent, we turned to a fingerprinting approach, similar to

the one used by Davies et al. [3]. Using a bytecode reader,

the constant pool for each class file is processed and hash

generated. Specifically, a class’ superclass, interfaces, fields,

method signatures and referenced types are collected, sorted,

and then hashed. Synthetic fields and methods are discarded,

as their generation is compiler implementation-specific.

While this fingerprinting approach dramatically reduces

false negatives, it does introduce false positives. The finger-

print cannot detect a modification to a method body that

does not alter which external types/methods are referenced.

Thus subtle changes between versions will not be detected,

which can cause two slightly different versions of a library

to be conflated. This is, however, vastly preferable to the

improper fragmentation that occurs with false negatives, and

in practice does not occur very often between major releases,

as for two library versions to be conflated, every change

across the entire library would have to be undetectable.

The 170, 638 jar files from Maven-derived collection

contain a total of 27, 820, 025 class files and 1, 626, 288unique declared type names.

B. Generate Preliminary Clusters

After the unknown artifacts are modeled, Astra’s next

step is to generate a preliminary clustering of types, as

seen in Part C of Figure 2. The preliminary clusters are

defined such that for any two types (ignoring versions),

if those types always co-occur in jar files, then they

are in the same cluster. For example, in the Maven

repository, org.eclipse.jetty.http.HttpCookieis present in a jar file if and only if

org.eclipse.jetty.http.HttpContent is also

present. Those two types are therefore assigned to the same

preliminary cluster. There exists exactly one preliminary

clustering, due to the transitivity of co-occurrence. Each

type is assigned to exactly one cluster.

For this stage, each type is identified only by its fully

qualified name; the version of each type is ignored so

that different versions of the same library will be grouped

together. One weakness of this approach to identification

is that the clustering is very sensitive to renaming. For

example, packages are often renamed when one library

wishes to include a specific version of another library as a

dependency and wants to avoid naming conflicts. We intend

to explore methods for resolving such renaming in future.

The 1, 626, 288 declared typed from the Maven-derived

collection were grouped into 77, 236 preliminary clusters.

C. Merging Clusters using Versions

Once the preliminary clusters are generated, Astra merges

similar clusters together using the type version fingerprints to

44

�������������

��� ������� ���

�����������

�������������� ��������

�����������

��������������������

�����������

�������������� ��������

�����������

��������������������

��

������� ��������

����������������

���������������� ��� ��������

��� �����

��������������

��������������������

�������� ��� ��������

������������������������ �����

��� ������� ���

�����������

�������������� ��������

�����������

��������������������

�����������

�������������� ��������

�����������

��������������������

��

�������� ������� ��������

��������������

���������� �����

�������

�� �������

��������������

��������

���������������

�������

�� �������

��������������

��������

����� ���������� �������

������

!��������������� "����#"��

�������������������������

���������������� �

�������������

�����������

�����������

� ������������

��������������� �����$

��������������

�������������� �����

����������������

���������������� ��� ����%

������������������

�� � �

�����������

&�������� �����$

����������������

���������������� ��� ��������

��� �����

��������������

��������������������

�����������

�������������� ��������

�������

�����������

��������������������

'

(

)

&

Figure 2. Astra Workflow

45

Type VersionsCluster Types Jar 1 Jar 2 Jar 3

jetty.http.HttpCookie 1 2 2jetty.http.HttpContent 1 1 2

Table VICLUSTER VERSION EXAMPLE

generate a final clustering as seen in Part D of Figure 2. The

goal is to merge together clusters that contain types from dif-

ferent versions of the same library. For example, while both

junit.framework.Assert and org.junit.Afterbelong to the JUnit library, Assert is present in all versions

of JUnit, while After only shows up after version 4.0. This

results in the preliminary clustering placing Assert and

After in separate clusters. This step corrects that division,

and merges those two clusters together. In the resulting

cluster, those types that are always present are termed core

types, while those that are only present sometimes are

termed version types.

For this step, Astra divides each cluster into a set of

cluster versions. Each cluster contains at least one cluster

version, one for each valid combination of versions of its

types. A combination of type versions is considered valid if

that combination is found in at least one jar file.

To illustrate cluster versions, consider the cluster

from Table VI. This simplified cluster contains two

types org.eclipse.jetty.http.HttpCookieandorg.eclipse.jetty.http.HttpContent,

which are found in three jar files. Each type has two

possible versions, and the version of each type present

in a specific jar file is represented by the number in

the corresponding cell. This cluster has three versions,

one for each of the combinations present in a jar file.

Note that jetty.http.HttpCookie version 1 and

jetty.http.HttpContent version 2 is not a valid

cluster version, as that combination is not found in a jar

file. A cluster can have at most one version per matching

jar file, but in practice the number is much lower. The true

org.eclipse.jetty.util cluster, for example, has

34 versions despite having 402 archive files.

Astra examines the preliminary clusters in order of popu-

larity. A cluster’s popularity is defined as the number of jar

files containing that cluster. This ordering is used because a

library’s core types must occur more often than its version

types. By examining the most popular clusters first, we

guarantee that we will encounter the core types in a library

before its version types.

When a cluster is examined, Astra looks at every less

popular cluster to determine if it should be merged into the

cluster being examined. A cluster is merged if it meets the

following three criteria:

1) There exists at least one version of the core cluster for

which the candidate cluster is always present.

2) There exists no version of the core cluster for which

the candidate cluster is only sometimes present.

3) If the candidate cluster is present, the core cluster is

also present.

The first criterion is the most important, and identifies can-

didate clusters that co-occur with at least one version of the

core cluster. The second criterion prevents two libraries that

are commonly packaged together from being inappropriately

merged. The final criterion prevents a popular library from

improperly subsuming a less popular library.

If a candidate cluster is merged into a core cluster, the

candidate cluster is removed from the list of clusters. If the

core cluster is expanded, the matching process is repeated

on the now expanded cluster until no new clusters are

added. This accounts for the possibility that the types that

distinguish two versions of a library are not present in the

initial core, and are only added in one of the merging steps.

When this step is applied to the 77, 236 preliminary

clustering, the number of clusters is reduced to 22, 564.

D. Identifying Libraries

Astra’s final step is to identify the libraries themselves.

The goal is to use the clusters to group together jars that

represent the same library. The clusters from the previous

step do not directly correspond to libraries, as libraries are

often fragmented into multiple clusters due to dependencies

and inconsistent packaging.

For example, recent versions of JUnit are packaged with

Hamcrest, a matcher library. The JUnit library is therefore

the combination of the cluster of the core JUnit types

with the cluster of Hamcrest types. Clusters are further

fragmented because libraries regularly have unused types

stripped away to save on space. Sometimes, for instance, a

developer is only interested in using a fraction of Hamcrest,

and so only includes a few of its types. This results in

Hamcrest, a singular library, being fragmented into multiple

clusters depending on how developers commonly divide it.

Library identification is split into two phases, and is

centered on the jar files. Primary libraries are identified first,

where Astra attempts to match each jar file to a cluster.

Compound libraries are then created around any jar files

that could not be matched.

Primary Libraries: The structure of a primary library

can be seen in Part E of Figure 2. Each primary library

contains a single core cluster, and some number of version

clusters. Primary library identification proceeds similarly to

the cluster merging described in the previous step.

The clusters are again sorted according to their popularity

and examined one by one. Each cluster that is examined is

considered the core of a new library. Then, Astra looks for

clusters to be added as version clusters. A cluster is added

if it meets the following two criteria:

46

1) There exists at least one version of the core cluster for

which the candidate cluster is always present.

2) There exists no version of the core cluster for which

the candidate cluster is only sometimes present.

By dropping the third criterion, which was designed to pre-

vent situations where a JUnit-like cluster would be merged

with a Hamcrest-like cluster, Astra will now group together

the JUnit and Hamcrest clusters to form the JUnit library.

While a type with a given fully qualified name may only

occur in a single cluster, it can be in as many libraries as

appropriate.

Jar files are assigned to a primary library if they contain

the core cluster, any subset of the additional clusters, and no

clusters that are not part of the library. Given this definition,

there is exactly one primary library per cluster. However,

not all of these libraries have jar files associated with them.

Imagine, for example, two libraries A and B which both

depend on library C. The corpus contains jar files for Aand B, but none for C. The algorithm will identify C as a

primary library, but there will not be any archive file that

directly corresponds to it. We term these libraries phantom

libraries.

Of the 22, 564 primary libraries identified in the Maven

collection, 19, 296 of them matched at least one jar file.

Conversely, of the 170, 628 jar files, 135, 207 matched a

primary library and 35, 421 did not.

Compound Libraries: The jar files that could not be

assigned to any primary libraries are instead divided into

compound libraries, as described in Part F of Figure 2.

While a primary library is centered on a core cluster, a

compound library is instead a set of clusters that regularly

appear together. The compound libraries are identified by

iterating through the remaining archive files and assigning

them one by one to a library based on their clusters.

When applied to the Maven-based corpus, the 35, 421remaining files files were divided into 4, 324 compound

libraries, giving 26, 888 total libraries.

1) Library Versions: Both primary and compound li-

braries have their jar files divided into versions. Two jar files

are assigned to the same version if they contain the identical

set of types, according to the version fingerprinting process

described earlier. The 26, 888 libraries in the Maven-based

corpus were broken into 77, 091 versions.

IV. EVALUATION

To verify Astra’s effectiveness at building a structured

artifact repository, we compared the structure of a manually

curated Java artifact repository with a repository Astra

generated from its unstructured contents. Astra processed

the 170, 638 jar files from the Maven Central Repository.

We then computed the similarity between the structure of

original Maven repository and the structure of the generated

repository. Additionally, we manually examined 100 cases

Maven Custom

Jar Count 170, 628 65, 673Unique Type Count 1, 626, 288 2, 596, 452

Preliminary Cluster Count 72, 236 63, 866Merged Cluster Count 22, 564 23, 739Cluster Version Count 88, 880 70, 484

Primary Library Count 22, 564 23, 739Phantom Library Count 3, 268 3, 278Compound Library Count 4, 324 3, 389

Library Count 26, 888 27, 128Non-Empty Library Count 23, 620 23, 850Library Version Count 77, 091 49, 400

Table VIIGENERATED REPOSITORY STATISTICS

where the two structures differed to identify the cause of the

differences.

We also wanted to verify that the algorithm would gen-

erate a reasonable structure if it was not given high quality

input. The jar files comprising the Maven Central Repository

are all discretely packaged versions of popular libraries.

There is no such guarantee for jar files collected at large.

We therefore fed our algorithm the 65, 673 jar files found

in the 31, 047 as described in Section II. We then manually

compared the resulting repository with the structure of the

other generated repository with respect to 20 popular Java

libraries.

A. Generated Repositories

Table VII contains some general statistics on two reposito-

ries generated by our algorithm. The Maven-based repository

was generated from the jar files contained in the Maven Cen-

tral Repository, while the custom repository was generated

from the jar files found in the Sourcerer Repository. Given

the differences between the underlying jar file collections,

as described in Section II, it was expected that the project-

based repository would contain significantly more libraries,

and relatively fewer versions of each library. This is born

out by the data.

B. Evaluation of Maven-based Repository Structure

This section presents the comparison of the structure of

the Maven Central Repository with a repository generated

by Astra from its jar files. As discussed in Section II,

individual artifacts in the Maven Central Repository are

uniquely identified by a groupId, artifactId and version.

Artifacts that share both a groupId and artifactId are versions

of the same library, and we will refer to them as Maven

libraries. The goal of our comparison was to see how closely

the Maven libraries corresponded to the generated libraries.

As seen in Table VII, there were a total of 26, 888 generated

libraries. By contrast, there are 23, 238 total Maven libraries.

Each generated library was matched with its correspond-

ing Maven libraries. A generated library matches a Maven

47

Figure 3. Categorization of Generated Libraries According to MatchingMaven Libraries

library if they share at least one jar file. There are four pos-

sibilities when matching a generated libraries with Maven

libraries.

• Perfect Match: They match perfectly. Every jar file in

the generated library is also in the Maven library, and

vice versa.

• Fragmented: The generated library is a subset of

the Maven library. This means the Maven library has

been fragmented into multiple separate libraries by our

algorithm.

• Combined: The generated library contains a superset

of the Maven library. This means that multiple Maven

libraries have been combined into a single library by

our algorithm.

• Fragmented & Combined The generated library con-

tains a fraction of multiple Maven libraries. This is a

combination of the previous two situations, where the

generated library has both combined multiple Maven

libraries and fragmented them.

Figure 3 contains the results of assigning each generated

library to one of those four categories. As seen in the first

row, there is complete agreement between the two reposi-

tories in slightly under two thirds of cases. The remaining

third of cases is primarily made up of generated libraries

that fragment Maven libraries.

In addition to categorizing the generated libraries, we

quantified how closely the two repositories matched. For

this, we computed a Jaccard similarity coefficient for each

combination of generated library and matching Maven li-

brary. A Jaccard coefficient compares two sets by taking

the size of their intersection and dividing it by the size

of their union. For comparing libraries, this corresponds to

comparing the set of jars contained by the generated library

with the set of jars contained by the Maven library. For

a generated library that matches multiple Maven libraries,

Figure 4. Average Jaccard Similarity Coefficients by Category

we computed the Jaccard coefficients separately and then

averaged them.

For example, imagine that Astra generated a library for

JUnit that contained 7 JUnit jars plus one additional jar.

In the Maven repository, those 7 JUnit jars all belonged to

the same Maven library, and that additional jar belonged

to a library with one other jar. In this situation, we would

compute two Jaccard coefficients; one comparing the gen-

erated library with the Maven JUnit library, and the other

comparing the generated library with the additional Maven

library. The first coefficient would be equal to 78 , as the

two libraries contained 7 shared artifacts and 8 total unique

artifacts. The second coefficient would be equal to 19 as they

shared 1 artifact and contained a total of 9 unique artifacts.

The overall score for the generated library would then be

the average of 78 and 1

9 .

Figure 4 contains the results of the similarity calculations.

The error bars show one standard deviation. The first column

contains the average the Jaccard similarity coefficient for

every generated library, which was 0.77. The remaining

columns show the average Jaccard similarity coefficient

when only the projects classified in the specified category

are counted. We can see that the projects in the combined

category in general have a much higher similarity coefficient

than those in either other category.

Finally, we examined a selection of cases where the struc-

ture of our generated repository differed from the structure

of the Maven Central Repository. Our goal was to identify

the reasons why some generated libraries failed to perfectly

match their Maven counterparts, and to categorize the cause

of these failures. We looked at 100 libraries that were not

classified as a perfect match.

As we saw earlier, there are three categories of failure:

fragmented libraries, combined libraries, and libraries that

are both fragmented and combined. Of the 100 libraries

we examined, 65 of them were fragmented, 24 of them

48

combined and 11 of them both fragmented and combined.

We discovered, perhaps not surprisingly, that in cases where

libraries were both fragmented and combined, the cause was

a combination of two separate causes.

The following are the causes for library fragmentation that

we identified.

• Addition (35 cases): Two versions of a Maven library

contain the identical core of types, down to the exact

type versions. The later library version includes some

additional types. The later library version gets frag-

mented into a new library, because the versions of the

core files do not predict the additional files.

• Compound Addition (24 cases): No single cluster pre-

dicts the other clusters, so no primary library matches

the Maven library. Instead, the same Maven library is

fragmented over multiple compound libraries because

different versions of Maven library match slightly dif-

ferent sets of clusters.

• Package Rename (10 cases): The later version of a

Maven library has changed the package name prefix

used in earlier versions.

• Dramatic Change (5 cases): Different versions of the

same Maven library are dramatically different. Often

seen between alpha release and full release. Features

include vastly more files, package renaming and altered

files.

• Packaging Error (2 cases): Some of the versions of

the Maven library are mislabeled and actually contain

a different library.

The first two causes of failure are due to limitations in

the algorithm, while the remaining three are instead due

to inconsistencies in the underlying data. We believe that

many instances of addition and compound addition can be

corrected by improving the algorithm, and will discuss that

in more detail in Section VI. We also plan to augment the

algorithm to identify cases where package renames have

occurred. For the remaining two causes of failure, we believe

the algorithm acted appropriately. So while the generated

classification disagrees with the one present in the Maven

repository, we prefer the generated one.

The following are the causes for library combination that

we identified.

• Unique Version (10 cases): One Maven library in-

cludes a different Maven library. The version of the

second library used by the first library is not found

elsewhere in the repository, which causes Astra to

believe that the first library is a version of the second

library. The result is that the two are combined.

• Group Rename (9 cases): At some point a project

changed its group name, causing the earlier versions

to be found in one Maven library and later versions to

be found in another. The algorithm combined the two

Maven libraries.

• Duplication (8 cases): Two different Maven libraries

contain the exact same contents.

• Maven Error (8 cases): Incorrect use of Maven cate-

gorization, such as putting the version in the artifactID.

The first cause of failure is due to a limitation in the

algorithm, while the remaining three are again due to in-

consistencies in the underlying data. We believe that many

instances of unique version failure can be corrected by

augmenting the algorithm further, and will discuss that in

more detail in Section VI. For the remaining three causes

of failure, we believe the algorithm acted appropriately. So

while the generated classification disagrees with the one

present in the Maven repository, we prefer the generated

one.

C. Evaluation of Custom Repository Structure

To verify that Astra can generate a reasonable structure

when given lower quality input, we fed our algorithm the

65, 673 jar files from the custom corpus described in Section

II. We then manually examined the resulting repository with

respect to 20 popular Java libraries, to see if the jar files

corresponding to each popular library were grouped together

in a reasonable manner. Given the difficulty in identifying

a correct grouping, and hence the subjective nature of this

evaluation, we will describe our findings qualitatively.

Upon examination, it became clear that the quality of

the input jar files impacted Astra’s ability to generate a

repository structure. However, while the structure is not quite

as good as the Maven-based repository, it is reasonable. For

each of the popular libraries, Astra was able to identify a

handful of generated libraries that matched our conception

of the popular library.

One noticeable difference between the repositories was

in the distribution of popular libraries between jar files.

In the custom repository, the popular libraries are found

in relatively more jar files. For example, the JUnit type

org.junit.runner.Runner is in 192 jars in the cus-

tom repository and only 145 jars in the Maven-based reposi-

tory, despite the Maven-based repository having significantly

more jar files. This trend was repeated across all of the top

libraries. This is likely due to developers packaging their

library dependencies together into a single jar file, which

causes popular libraries to be overrepresented.

Lastly, we found that the impact of the compound addition

failure cause is more greatly felt with lower quality input.

In the custom repository, a larger proportion of the popular

libraries were fragmented into multiple generated libraries

in than in the Maven-based repository. We will focus on

addressing this cause of failure.

D. Threats to Validity

There are a number of threats to the validity of our

evaluation. First, we focused entirely on Java. While we

believe that Astra can be successfully applied to other

49

programming languages, that may not in fact be the case.

Second, our dataset was limited to the fraction of open

source Java found in the Maven Central Repository and in

our custom repository. It’s possible that these results would

not generalize to the Eclipse Java ecosystem, for exam-

ple. Lastly, our evaluation of the project-based repository

structure was entirely subjective. While it suggests that a

repository generated from input of unknown quality can be

usable, a more comprehensive evaluation using a controlled

study would be necessary to verify this.

V. RELATED WORK

There is a large body of work focused on the identifi-

cation and organization of software components [11], and

software component repositories bear many similarities to

the structured artifact repositories discussed in this paper.

However, the focus of component identification research has

been on identifying fragments of systems that can easily be

extracted as components and reused [2]. In contrast, Astra

attempts to identify reusable components by examining a

large collection of artifacts derived from multiple projects.

While the result is similar, the approach is quite different.

Research in clone detection has also informed the creation

of Astra. For clone detection systems, the goal is to identify

similar units of code [1], [9]. Yet Astra’s goal is to use clone

information to identify libraries. Astra’s method for detect-

ing type versions is similar to one the authors previously

used to detect file-level cloning [8], and to techniques used

in identifying jar licensing [3], [5], [4].

VI. FUTURE WORK

The most pressing area for future work the failure causes

identified in our evaluation. We believe that some instances

of addition can be corrected by improving the fidelity of our

type fingerprint, which will allow us to distinguish versions

better. However, there are likely cases where the versions of

the core files cannot predict the inclusion of new types, in

which case a fundamental addition to the algorithm will be

necessary. In order to handle compound libraries better, we

need to build in tolerance for variable clusters. Additional,

a clone detection-based approach could identify package

renames. Lastly, a name-based heuristic would likely help

handle the failures due to unique versions causes libraries

to be improperly combined.

Another area for future work is the inclusion of call-graph

analysis in the clustering process. We can improve the type

clustering by identifying self-referencing subgraphs.

To improve the usability of the generated repository,

we plan on exploring methods for assigning names to the

libraries.

VII. CONCLUSION

In this paper, we presented Astra, an algorithm for

the automated bottom-up construction of structured artifact

repositories. We demonstrated the need for Astra, as existing

artifact repositories do not sufficiently cover the libraries

used by developers in practice.Astra builds a structured artifact repository from any

collection of unknown library artifacts, such as jar files.

This bottom-up construction is accomplished by analyzing

the co-occurrence of types between the library artifacts. Our

evaluation demonstrated that Astra generates a repository

with 77% similarity to the Maven Central Repository when

provided the same base artifacts. An examination of the

differences between the manually curated and generated

repositories indicated that in many cases the generated

structure has significant merit.

ACKNOWLEDGMENT

This material is based upon work supported by the Na-

tional Science Foundation under Grant No. 1018374.

REFERENCES

[1] S. Bellon, R. Koschke, G. Antoniol, J, J. Krinke, andE. Merlo. Comparison and Evaluation of Clone Detec-tion Tools. IEEE Transactions on Software Engineering,33(9):577–591, Sept. 2007.

[2] G. Caldiera and V. R. Basili. Identifying and qualifyingreusable software components. Computer, 24(2):61–70, 1991.

[3] J. Davies, D. German, M. Godfrey, and A. Hindle. SoftwareBertillonage: Finding the provenance of an entity. In Proc.of the 2011 Working Conference of Mining Software Reposi-tories. Citeseer, 2011.

[4] M. Di Penta, D. M. German, and G. Antoniol. Identifyinglicensing of jar archives using a code-search approach. In2010 7th IEEE Working Conference on Mining SoftwareRepositories (MSR 2010), pages 151–160. Ieee, May 2010.

[5] A. Hemel, K. T. Kalleberg, and R. Vermaas. Finding SoftwareLicense Violations Through Binary Code Clone Detection.Analysis, (Section 5), 2011.

[6] C. W. Krueger. Software reuse. ACM Comput. Surv.,24(2):131–183, June 1992.

[7] J. Ossher, S. Bajracharya, and C. Lopes. Automated De-pendency Resolution for Open Source Software. In MiningSoftware Repositories (MSR), 2010 7th IEEE Working Con-ference on, pages 130–140, May 2010.

[8] J. Ossher, H. Sajnani, and C. Lopes. File cloning in opensource Java projects: The good, the bad, and the ugly. InSoftware Maintenance (ICSM), 2011 27th IEEE InternationalConference on, pages 283–292, 2011.

[9] C. K. Roy, J. R. Cordy, and R. Koschke. Comparison andevaluation of code clone detection techniques and tools: Aqualitative approach. Science of Computer Programming,74(7):470–495, May 2009.

[10] D. Spinellis and C. Szyperski. How is open source affectingsoftware development? Software, IEEE, 21(1):28–33, 2004.

[11] C. Szyperski. Component Software: Beyond Object-OrientedProgramming. Addison-Wesley, 2002.

50