mining and modeling the open source software communityoss/papers/jin_defense_final.pdf · what is...

72
Ph.D defense, CSE Dept. Univ. of Notre Dame MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITY JIN XU Ph.D Defense

Upload: others

Post on 30-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. ofNotre Dame

MINING AND MODELINGTHE OPEN SOURCE

SOFTWARE COMMUNITY

JIN XUPh.D Defense

Page 2: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

PRESENTATION OUTLINE

Research Motivation Data Collection Methodologies Social Network Theory OSS Network Topology OSS Community Structure Simulation and Validation

Page 3: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

RESEARCH MOTIVATION

What is Open Source Software? Computer Software Freely access, modify and redistribute

OSS developers Distributed/decentralized Volunteers Users participate in OSS development

Examples of OSS Linux, Apache, Sendmail, Perl, PHP, R, Python, Java

Page 4: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

THE STATE OF ART

Computer software engineering Success factors (Cronston (2003), et al.)

Development process, work practice (Scacchi (2004), et al.)

Economics Innovation process (Von Hippel (2005), et al)

Psychology Motivations of individuals (Hars (2001), Bitzer (2005), etc.) Motivations of organizations (Rossi (2005), et al)

Organizational science Modes of organization and performance (Dalle (2005), et al)

Page 5: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

LIMITATIONS OF OSS STUDIES

Qualitative studies Collecting data is difficult

Incomplete and inaccurate data Samples collected using questionnaires

Lack of studies on relationships among projectsand developers

Lack of simulation models Missing historical data Future prediction

Page 6: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

OBJECTIVES

Collect and mining OSS community data Identify data resources Build mining tools

Study OSS communities Projects Developers Relationships

Build and validate simulation models Developer activities Project life cycles

Page 7: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

PRESENTATION OUTLINE

Research Motivation Data Collection Methodologies Social Network Theory OSS Network Topology OSS Community Structure Simulation and Validation

Page 8: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

OSS WEB REPOSITORIES

OSS development − distributed and decentralized Conflict of individual interests and project goals Easy join and leave Synchronization

Web repositories Version control system Bug report and fix system Discussion tools

Two types of OSS web repositories Individual vs. group

Page 9: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

DATA SOURCE

SourceForge.net Over 140,000 projects and 1.5 million registered

users Project web servers, trackers, mailing lists, discussion

boards, software management tools Data information

Classification, list of developers, statistics, forums,trackers

User id, name, participated projects Data limitations

Missing data, outliers, anonymous data

Page 10: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

MINING PROCESS

webdocuments

database

crawler

query

cleaned data

rawdata

report analysis

Extraction cleaning

preprocessing

StatisticsMathematicsData mining

summary

Page 11: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

WEB CRAWLER

WEB CRAWLER

URLaccessingParser

Wordextractor

Tableextractor

Link extractor

Linkexsits?

database End

yes

no

yes

no

Page 12: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

PRESENTATION OUTLINE

Research Motivation Data Collection Methodologies Social Network Theory OSS Network Topology OSS Community Structure Simulation and Validation

Page 13: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

SOCIAL NETWORK ANALYSIS

Complex and Self-organizing System Large numbers of locally interacting elements

Dynamic Social Network Main components ―developers and projects

Many developers participate on one project A developer participates on more than one project

Study of developer roles and their activities

Page 14: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

OSS COMMUNITY

User Group Passive Users: no direct contribution Active Users: report bugs, suggest features, etc.

Developer Group Peripheral Developers: irregularly contribute Central Developers: regularly contribute Core Developers: extensively contribute, manage

CVS releases and coordinate peripheral developersand central developers.

Project Leaders: guide the vision and direction of theproject.

Page 15: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

OSS DEVELOPMENTCOMMUNITY

Co-developers

Project Leaders

Core Developers

Active Users

Page 16: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

OSS SOCIAL NETWORK

Two Entities: developer & project Project-Developer Network (bipartite

graph) Node: project, developer Edge: developers in a project are connected

to that project

Page 17: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

PROJECT-DEVELOPERNETWORK

Page 18: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

OSS SOCIAL NETWORK (cont.)

Developer Network (unipartite graph) Node: developer Edge: developers in a project are connected to each

other Impact of developers

Project Network (unipartite graph) Node: project Edge: two projects are connected if there is a

developer participates on both. Impact of projects

Page 19: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

OSS DEVELOPER NETWORK

Page 20: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

DATA COLLECTION &EXTRACTION

Data Source: SourceForge 2003 datadump

Project Leaders & Core Developers Identification stored in data dump

Co-developers & Active Users Forum: ask & answer questions Artifact: bug report, patch submission, feature

request, etc.

Page 21: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

SUBSET OF THE SCHEMA

Page 22: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

ANALYSIS

Page 23: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

MEMBER DISTRIBUTION

40.6% 55.8% 2.7% 0.9% 70

31.7% 60.3% 5.7% 2.1% 193

11.8% 19.8% 20.6% 47.8% 64847

ActiveUsers

Co-developers

CoreDevelopers

ProjectLeaders

ProjectNumber

ProjectSize

88!

279

88

!

>

279>

Page 24: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

PRESENTATION OUTLINE

Research Motivation Data Collection Methodologies Social Network Theory OSS Network Topology OSS Community Structure Simulation and Validation

Page 25: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

TOPOLOGICAL PROPERTIES

Degree Distribution The total number of links connected to a node Relative frequency of each degree value Poisson distribution vs.Power law distribution

Diameter The maximum longest shortest-path The average longest shortest path

Cluster A social network consists of connected nodes

Clustering Coefficient The ratio of the number of links to the total possible number of

links among its neighbors

Page 26: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

DEGREE DISTRIBUTION

Page 27: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

ANALYSIS (cont.)

21659279833428043826# of project clusters

2020341972nd Largest project cluster

401753079415091737Largest project cluster

0.82970.88670.80780.8406Clustering coefficient

2.72.710.2Inf.Diameter

319981339817~1Z2

32415086~1Z1

1616911395078311858651SizeDCBAProperty

Page 28: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

SUMMARY

Identified and quantified co-developer andactive user roles

Different community distributions for largeand small projects

Small World Phenomenon Small diameter & high clustering coefficient

Scale Free Power law distribution

Effect of Co-developers & Active Users

Page 29: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

PRESENTATION OUTLINE

Research Motivation Data Collection Methodologies Social Network Theory OSS Network Topology OSS Community Structure Simulation and Validation

Page 30: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

COMMUNITY STRUCTURE Communities: In some networks, some nodes

are grouped together by a high density of edges,while there are few edges between those groups

Communities: tightly-connected groups (Newman& Girvan, 2004)

Importance: Nodes within a community ― common features Nodes between communities ― hubs

Page 31: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

INFLUENCE IN OSS NETWORKS

Identify projects which might have relatedsubjects, similar programming environment, orcommon developers.

Study projects interactions during their growth. Get information about the communication path

and knowledge flow within or betweencommunities. Such information can help usadjust and improve the robustness ofcommunications in OSS

Page 32: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

HIERARCHICAL CLUSTERING Tree data structure Measured strength on

each pair of vertices Shortest path The inverse Total number of paths

Join a pair with thestrongest strength

Weakness: inaccurateclustering (Newman &Girvan, 2004)

Page 33: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

BETWEENNESS

Edge betweenness

The Girvan & Newman method Top-down, remove edges with the highest

betweenness O(m^2n) time on a network with m edges and n

vertices O(n^3) on a sparse graph

!=ij

ij

BN

eNvC

)()(

Page 34: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

GREEDY ALGORITHM

Modularity Q: the fraction of edges within a communityminus the expected value of the fraction of edges withina community in a random network.

The best community structure is where Q is the largest2( )

is the fraction of edges that connect nodes in group to group

is the fraction of edges that connect to nodes in group versus

the total edges in the network

is th

ii i

i

ii

ij

ij

i

i

Q e a

ka

e

e i j

a i

k

= !

=

"

"

e number of edges connecting to group i

Page 35: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

DATA STRUCTURE

A matrix contains . Each row of the matrix isstored as a balanced binary tree for O(log(n))search and insertion and as a max-heap forconstant time to find the largest element.

A max heap contains the largest element ofeach row

An array contains

ijQ!

ia

Page 36: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

GREEDY ALGORITHM The initial values

Select from the max-heap and communities aregrouped

Update values

O(md log(n))

!

!!

=

"#

"$

%&

='

ij

ii

ij

ji

ijij

e

ka

connectedjie

KK

eQ

otherwise 0

, 1

2

ijQ!

ijj

kijk

kjik

jkik

jk

aaa

aaQ

aaQ

QQ

Q

+=

!"

!#

$

%&

%&

&+&

=&

'

'

inot j, toconnected isk group if 2

jnot i, toconnected isk group if 2

ji, toconnected isk group if

Page 37: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

PROBLEM SETUP

The largest component of the project network inJan. 2003.

Exclude project 1 which is the SourceForgesoftware project because it links to 10,000 otherprojects

The project network consists of 27,834 nodesand 173,644 edges

Page 38: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

RESULT: THE VALUE OF Q

Page 39: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

ANALYSIS

Max Q = 0.2227 611 communities The largest community consists of 3467 projects Many small communities with size less than 10 The 10 largest groups include 64.8% of the

whole projects The important communication paths are those

connections between communities Common developers are keys to transfer

information between two project groups

Page 40: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

OSS ASSORTATIVE MIXING

Several explanations exists for the communitystructure

Mutual acquaintance: two nodes with a commonneighbor are more likely to link to each other

Homophily: two nodes with the same attributesare more likely to link to each other Measure: assortative mixing

Page 41: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

ASSORTATIVE MIXING

Assortative coefficient

1

is the fraction of edges in a network which connect nodes

of type to nodes of type

is the fraction of edges of each type which is attached to

nodes of type

ii i i

i i

i i

i

ij

j

e a b

r

a b

e

i j

b

i

!

=

!

" "

"

Page 42: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

ASSORTATIVE MIXING FOR OSS

0.1541Programming language

0.0449Intended audience

0.0553Development status0.0893User interface0.1078Operating System0.1009Topic

Assortative mixing coefficientCategory

Page 43: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

SUMMARY

Community structure exists among SourceForgeproject network

Key communication paths are identified amonggroups

Homophily exists in the OSS network Identify factors which are important in OSS

grouping

Page 44: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

PRESENTATION OUTLINE

Research Motivation Data Collection Methodologies Social Network Theory OSS Network Topology OSS Community Structure Simulation and Validation

Page 45: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

DOCKING

Docking Verify simulation correctness Discover pros & cons of toolkits

Four Models of OSS random graphs preferential attachment preferential attachment with constant fitness preferential attachment with dynamic fitness

Agent-based Simulation Swarm Repast

Page 46: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

DEVELOPER ACTIVITIES

New developer Existing developer

create join abandon idle

add a newproject

add a link remove alink

remove aproject

Page 47: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

VALIDATION USING DIAMETERS

Page 48: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

CONCLUSIONS

Addressed the challenge of data collection Classified roles of OSS developers Analyzed network properties of OSS community Identified community structures Explained the formation of community structure Presented simulation models and the docking

process

Page 49: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

FUTURE WORK

Add weights on vertices and edges More realistic representation

Apply more social network property analysis Closeness

Include more OSS websites Linux Perl Apache Savannah

Page 50: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

CONTRIBUTIONS

Presented a general solution for mining data Found features of OSS community Provided useful topological information on the

underlying structure and evolution of OSScommunity

Provided useful information on thecommunication and the diffusion of informationbetween projects

Demonstrated the use of docking for validationof agent-based simulation models

Page 51: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

PUBLICATIONS Jin Xu, Scott Christley, Greg Madey, "Application of Social Network Analysis to the Study of Open

Source Software", The Economics of Open Source Software Development, eds. Jürgen Bitzerand Philipp J.H. Schröder, Elsevier Press, 2006.

Jin Xu, Scott Christley, and Greg Madey, "The Open Source Software Community Structure",NAACSOS2005, Notre Dame, IN, June 2005.

Jin Xu, Yongqin Gao, Scott Christley, Greg Madey, "A Topological Analysis of the Open SourceSoftware Development Community", The 38th Hawaii International Conference on SystemsScience (HICSS-38), Hawaii, January 2005.

Jin Xu and Greg Madey, "Exploration of the Open Source Software Community", NAACSOS2004, Pittsburgh, PA, June 2004

Jin Xu, Yongqin Gao, Jeff Goett, and Greg Madey, "A Multi-Model Docking Experiment ofDynamic Social Network Simulations", Agent2003, Chicago, IL, October 2003

J. Xu, Y. Huang, and G. Madey, "A Research Support System Framework for Web DataminingResearch", Workshop on Applications, Products and Services of Web-based Support Systems atthe Joint International Conference on Web Intelligence (2003 IEEE/WIC) and In telligent AgentTechnology, Halifax, Canada, October 2003, pp. 37-41.

Jin Xu, Yongqin Gao and Greg Madey, "A Docking Experiment: Swarm and Repast for SocialNetwork Modeling", Seventh Annual Swarm Researchers Meeting (Swarm2003), Notre Dame, IN,April 2003.

Scott Christley, Jin Xu, Yongqin Gao, Greg Madey, "Public Goods Theory of the Open SourceDevelopment Community", Agent2004, October 2004, Chicago, IL.

Renee Tynan, Gregory Madey, Scott Christley, Vincent Freeh and Jin Xu, "Task characteristics,communication, and outcome in open source software development", under submission

Page 52: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

POTENTIAL PUBLICATIONS

Chapter 2 Journal of Knowledge and Information Systems

Chapter 4 Journal of Organizational Computing

Chapter 5 Journal of Artificial Societies and Social Simulation

Page 53: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

ACKNOWLEGEMENT

My advisor − Greg Madey My colleagues − Scott Christley, Yongqin Gao,

Yingping Huang, Jeff Goett, Ryan Kennedy Professor Renee Tynan My committee members − Dr. Bowyer, Dr.

Flynn, Dr. Cosimano, and outside chair − Dr.Seabaugh

NSF, CISE/IIS-Digital Science and Technology

Page 54: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

THANK YOU

Page 55: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

PREVIOUS RESEARCH

Roles Classification Nakakoji K., Yamamoto Y., Kishida K., Ye Y.,

"Evolution Patterns of Open-source SoftwareSystems and Communities", Proceedings of TheInternational Workshop on Principles of SoftwareEvolution, Orlando Florida, May 19-20, 2002.

Xu N., “An Exploratory Study of Open SourceSoftware Based on Public Project Archives”, Thesis,the John Molson School of Business, ConcordiaUniversity, Canada, 2003.

Page 56: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

PREVIOUS RESEARCH (cont.) Social Network Properties

Gao Y. Q., Vincent F., Madey G., "Analysis andModeling of the Open Source Software Community",North American Association for Computational Socialand Organizational Science (NAACSOS 2003),Pittsburgh, PA, 2003.

Madey G., Freeh V., and Tynan R., "Modeling theF/OSS Community: A Quantitative Investigation," inFree/Open Source Software Development, ed.,Stephan Koch, Idea Publishing, 2004.

Page 57: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

NETWORK PROPERTIES

Diameter

Clustering coefficient

away links 2 neighbors ofnumber average the

awaylink 1 neighbors ofnumber average the

nodes ofnumber total the

1)/log(

)/log(

2

1

12

1

!

!

!

+=

z

z

N

zz

zNd

developers have which projects offraction therepresents ,

projects join whodevelopers offraction therepresents ,

)32(

))((1

1

32112

2

2121

kppkv

kppk

vvvv

vvc

k

p

k

p

k

n

n

k

d

k

k

d

n

n

!

!

=

=

+"

""+

=

µ

µ

µµ

Page 58: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

Page 59: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

DOCKING

Three ways of Validation Comparison with real phenomenon Comparison with mathematical models Docking with other simulations

Docking Verify simulation correctness Discover pros & cons of toolkits

Page 60: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

SOCIAL NETWORK MODEL

Graph Representation Node/vertex – Social Agent Edge/link – Relationship Index/degree - The number of edges connected to a

node ER (random) Graph

edges attached in a random process No power law distribution

Page 61: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

SOCIAL NETWORKMODEL(Cont.)

Watts-Strogatz (WS) model include some random reattachment No power law distribution

Barabasi-Albert (BA) model with preferentialattachment Addition of preferential attachment Power law distribution

BA model with constant fitness addition of random fitness

BA model with dynamic fitness

Page 62: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

OSS NETWORK A classic example of a dynamic social network Two Entities: developer, project Graph Representation

Node – developers Edge – two developers are participating in the same

project Activities

Create projects Join projects Abandon projects Continue with current projects

Page 63: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

OSS MODEL

Agent: developer Each time interval:

Certain number developers generated New developers: create or join Old developers: create, join, abandon, idle Update preference for preferential models

Page 64: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

SWARM SIMULATION

ModelSwarm Creats developers Controls the activities of developers in the model Generate a schedule

ObserverSwarm Collects information and draws graphs

main Developer (agent)

Properties: ID, degree, participated projects Methods: daily actions

Page 65: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

REPAST SIMULATON

Model creates and controls the activities of developers collects information and draws graphs

Network display Movie snapshot

Developer (agent) Project Edge

Page 66: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

DOCKING PROCEDURE

Process: comparisons of parameterscorresponding models.

Findings: Different Random Generators Databases creation errors in the original version Different starting time of schedulers

Page 67: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

DOCKING PARAMETERS

Diameter Average length of shortest paths between all pairs of

vertices Degree distribution

The distribution of degrees throughout a network Clustering coefficient (CC)

CCi: Fraction representing the number of linksactually present relative to the total possible numberof links among the vertices in its neighborhood.

CC: average of all CCi in a network Community size

Page 68: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

DEGREE DISTRIBUTION

Page 69: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

CLUSTERING COEFFICIENT

Page 70: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

COMMUNITY SIZEDEVELOPMENT

Page 71: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

CONCLUSION

Same results for bothsimulations

Better performance ofRepast

Better displayprovided by Repast Network display

Random Layout

Circular Layout

Page 72: MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITYoss/Papers/Jin_defense_final.pdf · What is Open Source Software? Computer Software Freely access, modify and redistribute

Ph.D defense, CSE Dept. Univ. of Notre Dame

DOCKING PROCESS

OSS models parameters

DockingRepast

Swarm

ER

BA

BAC

BAD

Diameter

Degree distribution

Clustering coefficient

Community size

toolkits