mining open source software(oss) data using association rules

28
'Data from the Field' EII Workshop, 24 May 2007 1 Mining sourceforge data to Discover Models of Open Source Software (OSS) Project Performance Joseph Davis, Bavani Arunasalam, Simon Poon, Sanjay Chawla Knowledge Management Research Group School of Information Technologies The University of Sydney

Upload: tommy96

Post on 17-Jan-2015

931 views

Category:

Documents


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Mining Open Source Software(OSS) Data Using Association Rules

'Data from the Field' EII Workshop, 24 May 2007

1

Mining sourceforge data to Discover Models of Open Source Software

(OSS) Project Performance

Joseph Davis, Bavani Arunasalam, Simon Poon, Sanjay Chawla

Knowledge Management Research Group

School of Information Technologies

The University of Sydney

Page 2: Mining Open Source Software(OSS) Data Using Association Rules

'Data from the Field' EII Workshop, 24 May 2007

2

Outline• Motivation for this project

• Open Source Software (OSS) Development

• SourceForge data repository

• Data Mining Possibilities

• Association Rule Mining and Association Rules Network (ARN)

• Application of ARNs to OSS data

• Theory Building using Data Mining

• Conclusions and Future Research

Page 3: Mining Open Source Software(OSS) Data Using Association Rules

'Data from the Field' EII Workshop, 24 May 2007

3

Motivation

• Steady Success of Open Source Software(OSS): Linux, Apache, Samba, Python, MySQL

• KM group is trying to study a range OSS-related questions using theoretical and data-driven approaches

• Availability of extensive data on most aspects of OSS projects

• Question: What are the key factors that can explain ‘success’ in OSS projects?

Page 4: Mining Open Source Software(OSS) Data Using Association Rules

'Data from the Field' EII Workshop, 24 May 2007

4

Open Source Software Development

• Non-proprietary and perceived to be socially beneficial model of software development

• OS software in the public domain; source code freely available for modification and distribution

• Nearly 200,000 projects in progress, each involving dozens to hundreds of (geographically distributed) developers who coordinate their work through the internet

• Increasingly viewed as a viable model for building robust, secure, and scalable software - commons-based peer production model/distributed innovation.

Page 5: Mining Open Source Software(OSS) Data Using Association Rules

'Data from the Field' EII Workshop, 24 May 2007

5

OSS Trends

• Growing acceptance of OS software in organizations,

• Increasing participation by large software companies such as IBM, Sun, HP etc. in OSS development

• Increasingly viable software distribution business models

• Large and growing communities of OSS developers and users

Page 6: Mining Open Source Software(OSS) Data Using Association Rules

'Data from the Field' EII Workshop, 24 May 2007

6

Untested Claims regarding OSS development

• Good software evolves when a dedicated community (of developers and programmers) work cooperatively (in comparison with the more traditional hierarchical and closed model (OSI, 2001), ‘Cathedral’ and the ‘bazaar’ analogy.

• Quality, speed, portability, and scalability of the resulting software.

• Taming complexity, fewer bugs (many eyeballs phenomenon)

• Offers a viable model for the emerging ‘virtual organisations’.

Page 7: Mining Open Source Software(OSS) Data Using Association Rules

'Data from the Field' EII Workshop, 24 May 2007

7

Open Research Questions

• How do we discover crucial relationships that characterise successful and unsuccessful OSS projects?

• How can we develop models (specifying hypotheses) of the critical determinants of OSS project performance?

• What constitutes good performance in OSS development?

Page 8: Mining Open Source Software(OSS) Data Using Association Rules

'Data from the Field' EII Workshop, 24 May 2007

8

Field Data for OSS Research

• SourceForge.net is the largest OSS development website.

• Besides hosting, SourceForge.net provides services for version control, bug-tracking etc.

• Nearly 200,000 projects grouped under 17 categories; over 2 million users.

• Great source of ‘field’ data to research OSS development.

Page 9: Mining Open Source Software(OSS) Data Using Association Rules

'Data from the Field' EII Workshop, 24 May 2007

9

Page 10: Mining Open Source Software(OSS) Data Using Association Rules

'Data from the Field' EII Workshop, 24 May 2007

10

Problems with SourceForge

• Number of ongoing OSS projects is misleading. Most of the overall activity levels accounted for by fewer than 10% of the projects (Pareto distributions)

• Need for purposeful sampling and careful datacleaning – extreme variations across projects and noise

Page 11: Mining Open Source Software(OSS) Data Using Association Rules

'Data from the Field' EII Workshop, 24 May 2007

11

Problem Definition

• GIVEN: OSS Data downloaded from SourceForge.net

• OBJECTIVE: Find patterns which characterize a high performing OSS project

• CONSTRAINTS: Performance surrogate variable to be number of downloads.

Page 12: Mining Open Source Software(OSS) Data Using Association Rules

'Data from the Field' EII Workshop, 24 May 2007

12

Why not statistical models?

• Attributes were heterogeneous type:numerical and discrete

• Data plagued with missing values

• Downloads followed a Pareto distribution– Most downloads few but long tail– Ex: median download 70 but can be upto

600000

Page 13: Mining Open Source Software(OSS) Data Using Association Rules

'Data from the Field' EII Workshop, 24 May 2007

13

Association Rules• Association rule mining:

– Finding frequent patterns, associations, correlations among sets of items or objects in transaction databases, relational databases, and other information repositories.

• Applications:– Market basket data analysis, cross-marketing, catalog

design, loss-leader analysis, clustering, classification, etc.

• Examples. – Rule form: “Body ead for a given [support,

confidence]”.– buys(x, “diapers”) buys(x, “beers”) [1 %, 60%]– major(x, “CS”) ^ takes(x, “DB”) grade(x, “high”) [1%,

75%]

Page 14: Mining Open Source Software(OSS) Data Using Association Rules

'Data from the Field' EII Workshop, 24 May 2007

14

Typical Association Rule Mining Approaches

• Discover robust association rules that are non-obvious and actionable,

• Discover frequent item sets as features that serve as discriminators for classification and prediction (based on a class variable)

• Our approach seeks to discover a graph structure that characterises performance based on the mined association rules.

Page 15: Mining Open Source Software(OSS) Data Using Association Rules

'Data from the Field' EII Workshop, 24 May 2007

15

Association Rules

• Given: (1) database of transactions (OSS projects), (2) each transaction is a list of items (project variable values)

• Find: all rules that correlate the presence of one set of items with that of another set of items– E.g., 72% of OSS projects for which bug fixing

activity level is high and whose (number of developers =‘high”) ----- (number of downloads=‘high’)

Page 16: Mining Open Source Software(OSS) Data Using Association Rules

'Data from the Field' EII Workshop, 24 May 2007

16

Problems with Association Rule Mining

• Too many (irrelevant/redundant) rules generated• Measures of “interestingness” still primitive and

not general• Our solution: A pruning strategy – create an

Association Rules Network in a recursive manner:

Related Work:S. Chawla, J. Davis, G. Pandey, "On Local Pruning of Associaton Rules Using Directed Hypergraphs", IEEE Conference on Data Engineering (ICDE’04)

Page 17: Mining Open Source Software(OSS) Data Using Association Rules

'Data from the Field' EII Workshop, 24 May 2007

17

Association Rules Network

• Consider a binary table R(A,B,C,D,E,F,G)

• {B=1, C=1} -> {A=1}• {D=1} -> {A=1}• {F=0} ->{B=1}• {F=0, E=1} -> {C=1}• {E=1, G=0} -> {D=1}• {A=1,G=1} ->{E=1}

B=1

C=1

F=0

A=1

D=1

E=1

G=0

Fix a consequent {A=1}

Page 18: Mining Open Source Software(OSS) Data Using Association Rules

'Data from the Field' EII Workshop, 24 May 2007

18

ARN Definition

An ARN (R,z) is a weighted directed hypergraph G= (V U z, E) where z is a distinguished sink item (node) and R is the set of association rules such that

• Each hyperedge E corresponds to a rule R whose consequent is a singleton,

• There is a hyperedge which corresponds to a rule r whose consequent is the single item z.

Page 19: Mining Open Source Software(OSS) Data Using Association Rules

'Data from the Field' EII Workshop, 24 May 2007

19

ARN Definition cont..

• The distinguished vertex z is reachable from any other vertex in G.

• Any vertex p not equal to z is not reachable from z.

• The weight of the edges correspond to the confidence of the rule that they encapsulate.

Page 20: Mining Open Source Software(OSS) Data Using Association Rules

'Data from the Field' EII Workshop, 24 May 2007

20

Sampling

• Results based on a sample of 2301 ‘stable’ or ‘production’ projects which were initiated in the second half of 1999.

Page 21: Mining Open Source Software(OSS) Data Using Association Rules

'Data from the Field' EII Workshop, 24 May 2007

21

ARN for High Download

#Download= High

#Support Request = High

#Patches Completed= High

#Bugs Found= High

#ForumMessages

= High

# Developers= High

OS = POSIX

#CVS Committed= High

#Bugs Fixed= High

#Public Forums= High

# Administrators= High

78.7%

73.8%

68.4%

67.9%

90%

55.3%

93.3%

Page 22: Mining Open Source Software(OSS) Data Using Association Rules

'Data from the Field' EII Workshop, 24 May 2007

22

#Download= Low

#Support Completed = Low

#PublicForum= Low

#Bugs Found= Low

#ForumMessages

= Low

# Developers= Low

# OS = 1

#CVS Committed= Low

#Bugs Fixed= Low

# Mailing Lists= Low

# Administrators= Low

95.3%

77.9%

92.1%

60.1%

#Support Requested

= Low

#Task Completed= Low

Environment= Web based

# Patches= Low

#Surveys= Low

# Environments= 1

ARN for Low Download

Page 23: Mining Open Source Software(OSS) Data Using Association Rules

'Data from the Field' EII Workshop, 24 May 2007

23

Resulting Network

#Download

#Bugs Found

#ForumMessages

# Developers

#PublicForum= Low

#CVS Committed

#Bugs Fixed

#CVS Committed

#Administrators

Page 24: Mining Open Source Software(OSS) Data Using Association Rules

'Data from the Field' EII Workshop, 24 May 2007

24

Critical Factors

• Coding and bug fixing activity levels

• Communication intensity

• Core development team strength

Page 25: Mining Open Source Software(OSS) Data Using Association Rules

'Data from the Field' EII Workshop, 24 May 2007

25

Validation with Factor Analysis(FA)

• Independently applied FA.

• Factors are mutually orthogonal variables which are linear combinations of subsets of original variables.

• The factor structures generally consistent with the ARN results.

Page 26: Mining Open Source Software(OSS) Data Using Association Rules

'Data from the Field' EII Workshop, 24 May 2007

26

Page 27: Mining Open Source Software(OSS) Data Using Association Rules

'Data from the Field' EII Workshop, 24 May 2007

27

Related Research Projects

• Temporal analysis of OSS project evolution

• Studies of OSS communities

• Analysis of OS software code and community co-evolution (Samba)

• Study of open source software deployment in organisations.

Page 28: Mining Open Source Software(OSS) Data Using Association Rules

'Data from the Field' EII Workshop, 24 May 2007

28

Conclusion

• Need to understand the key drivers for OSS beyond experience-based intuition and isolated case studies

• Association Rules Network(ARN) give some insight into the process

• These insights consistent with results from Software Engineering

• Factor Analysis as a form of validation