an investigation into the free/open source software phenomenon using data mining, social network...
TRANSCRIPT
An Investigation into the An Investigation into the Free/Open Source Software Free/Open Source Software
Phenomenon using Data Phenomenon using Data Mining, Social Network Mining, Social Network
Theory, and Agent-Based Theory, and Agent-Based
Greg MadeyComputer Science & Engineering
University of Notre Dame
UIUC - NSF Workshop on Continuous (Re)Design ofOpen Source Software
University of Illinois, Urbana-ChampaignOctober 8-9, 2003
This research was partially supported by the US National Science Foundation, CISE/IIS-Digital Society & Technology, under Grant No. 0222829
ContributorsContributors• Vincent Freeh, Computer Science, North Carolina State University Vincent Freeh, Computer Science, North Carolina State University
(Principal Investigator)(Principal Investigator)• Yongqin Gao, Computer Science and Engineering, University of Yongqin Gao, Computer Science and Engineering, University of
Notre Dame (Graduate Student)Notre Dame (Graduate Student)• Jeff Goett, University of Notre Dame (REU Student)Jeff Goett, University of Notre Dame (REU Student)• Chris Hoffman, University of Notre Dame (REU Student)Chris Hoffman, University of Notre Dame (REU Student)• Nadir Kiyanclar, University of Notre Dame (REU Student)Nadir Kiyanclar, University of Notre Dame (REU Student)• Greg Madey, Computer Science & Engineering, University of Notre Greg Madey, Computer Science & Engineering, University of Notre
Dame (Principal Investigator)Dame (Principal Investigator)• Patrick McGovern, Director SourceForge.net, VA Software Patrick McGovern, Director SourceForge.net, VA Software
(Industrial Collaborator)(Industrial Collaborator)• Carlos Siu, University of Notre Dame (REU Student)Carlos Siu, University of Notre Dame (REU Student)• Renee Tynan, Department of Management, College of Business, Renee Tynan, Department of Management, College of Business,
University of Notre Dame (Principal Investigator)University of Notre Dame (Principal Investigator)• Jin Xu, Computer Science & Engineering, University of Notre Dame Jin Xu, Computer Science & Engineering, University of Notre Dame
(Graduate Student)(Graduate Student)
OutlineOutline
• Research approachResearch approach• Tools and definitions: Agents, models, Tools and definitions: Agents, models,
simulations, collaborative social networks, simulations, collaborative social networks, computer experimentscomputer experiments
• Data collection and analysisData collection and analysis• Example research questionExample research question• SimulationSimulation• Computer experimentsComputer experiments• ResultsResults
One Approach to One Approach to Researching F/OSSDResearching F/OSSD
• Online dataOnline data– Screen scrapingScreen scraping– Database dumpsDatabase dumps
• ModelingModeling– Social network theorySocial network theory– Evolutionary assumptionsEvolutionary assumptions
• SimulationSimulation– Verification and validationVerification and validation– Computer experimentsComputer experiments
• Variation of Classical Scientific MethodVariation of Classical Scientific Method
Classical Scientific Classical Scientific MethodMethod
1.1. Observe the worldObserve the worlda)a) Identify a puzzling phenomenonIdentify a puzzling phenomenon
2.2. Generate a falsifiable hypothesis Generate a falsifiable hypothesis (K. Popper)(K. Popper)
3.3. Design and conduct an experiment with Design and conduct an experiment with the goal of disproving the hypothesisthe goal of disproving the hypothesisa)a) If the experiment “fails”, then the hypothesis If the experiment “fails”, then the hypothesis
is accepted (until replaced)is accepted (until replaced)b)b) If the experiment “succeeds”, then reject If the experiment “succeeds”, then reject
hypothesis, but additional insight into the hypothesis, but additional insight into the phenomenon may be obtained and steps 2-3 phenomenon may be obtained and steps 2-3 repeatedrepeated
The Computer The Computer ExperimentExperiment
Agent-Based Simulation as Agent-Based Simulation as a Component of the a Component of the
Scientific MethodScientific MethodModeling(Hypothesis)
Agent -BasedSimulation(Experiment)
Observation
Agent-Based Simulation as Agent-Based Simulation as a Component of the a Component of the
Scientific MethodScientific MethodModeling(Hypothesis)
Agent -BasedSimulation(Experiment)
Observation
Social NetworkModel of F/OSS
Grow ArtificialSourceForge
Analysis ofSourceForge
Data
Agent-Based Modeling and SimAgent-Based Modeling and Simulationulation
• Conceptual models of a phenomenonConceptual models of a phenomenon• Simulations are computer implementations of Simulations are computer implementations of
the conceptual modelsthe conceptual models• Agents in models and simulations are distinct Agents in models and simulations are distinct
entities (instantiated objects)entities (instantiated objects)– Tend to be simple, but with large numbers of them Tend to be simple, but with large numbers of them
(thousands, or more) - i.e., swarm intelligence(thousands, or more) - i.e., swarm intelligence– Contrasted with higher level AI “intelligent agents”Contrasted with higher level AI “intelligent agents”
• Foundations in complexity theoryFoundations in complexity theory– Self-organizationSelf-organization– EmergenceEmergence
Collaborative Social NetwCollaborative Social Networksorks• Research-paper co-authorship, small world phenomenon, e.g., Erdos Research-paper co-authorship, small world phenomenon, e.g., Erdos
number number (Barabasi 2001, Newman 2001)(Barabasi 2001, Newman 2001)
• Movie actors, small world phenomenon, e.g., Kevin Bacon number Movie actors, small world phenomenon, e.g., Kevin Bacon number (Watts (Watts 1999, 2003)1999, 2003)
• Interlocking corporate directorshipsInterlocking corporate directorships• Terrorist NetworksTerrorist Networks• Open-source software developers Open-source software developers (Madey et al, AMCIS 2002)(Madey et al, AMCIS 2002)
• Collaborators are nodes in a graph, and collaborative relationship are the Collaborators are nodes in a graph, and collaborative relationship are the edges of the graph => a framework to model data/phenomenonedges of the graph => a framework to model data/phenomenon
SourceForgeSourceForge
• VA Software• Part of OSDN• Started 12/1999• Collaboration tools• 70,000 Projects• 90,000 Developers• 700,00 Registered Users
SavannahSavannah• SourceForge Software? • Free Software Foundation•1,600 Projects•16,000 Registered Users
ObservationsObservations
• Web miningWeb mining• Web crawler (scripts)Web crawler (scripts)
– PythonPython– PerlPerl– AWKAWK– SedSed
• MonthlyMonthly• Since Jan 2001 Since Jan 2001 • ProjectIDProjectID• DeveloperIDDeveloperID• Almost 2 million recordsAlmost 2 million records• Relational databaseRelational database
PROJ|DEVELOPER8001|dev3788001|dev89758001|dev99728002|dev276508005|dev313518006|dev125098007|dev193958007|dev46228007|dev356118008|dev8975
Collaboration NetworksCollaboration Networks
Adapted from Newman, Strogatz and Watts, 2001
15850 dev[46]dev[83] 15850 dev[46]
dev[48]
15850 dev[46]dev[56]
15850 dev[46]dev[58]
6882 dev[58]dev[47]
6882 dev[47]dev[79]
6882 dev[47]dev[52]
6882 dev[47]dev[55]
7028 dev[46]dev[99]
7028 dev[46]dev[51]
7028 dev[46]dev[57]
7597 dev[46]dev[45]
7597 dev[46]dev[72]
7597 dev[46]dev[55]
7597 dev[46]dev[58]
7597 dev[46]dev[61]
7597 dev[46]dev[64]7597 dev[46]
dev[67]
7597 dev[46]dev[70]
9859 dev[46]dev[49]9859 dev[46]
dev[53]
9859 dev[46]dev[54]
9859 dev[46]dev[59]
dev[46]
dev[83] dev[56]
dev[48]
dev[52]
dev[79]
dev[72]
dev[51]
dev[57]
dev[55]
dev[99]
dev[47]
dev[58]
dev[53]
dev[58]
dev[65]
dev[45]
dev[70]
dev[67]
dev[59]
dev[54]
dev[49]
dev[64]
dev[61]
Project 6882
Project 9859
Project 7597
Project 7028
Project 15850
F/OSS Developers - Collaboration Social NetworkDevelopers are nodes / Projects are links
24 Developers5 Projects
2 Linchpin Developers1 Cluster
Topological Analysis of Topological Analysis of the Datathe Data
• Statistics inspectedStatistics inspected– DiameterDiameter– Average degreeAverage degree– Clustering coefficientClustering coefficient– Degree distributionDegree distribution– Cluster size distributionCluster size distribution– Relative size of major clusterRelative size of major cluster– Fitness and life cycleFitness and life cycle
• Evolution of these statisticsEvolution of these statistics• Dual networks Dual networks
– developer network and project networkdeveloper network and project network
TerminologyTerminology• DiameterDiameter
– Average length of shortest paths between all pairs of verticesAverage length of shortest paths between all pairs of vertices• DegreeDegree
– The count of edges connected to given vertexThe count of edges connected to given vertex• Average degreeAverage degree
– Average of the degrees of all vertices in the networkAverage of the degrees of all vertices in the network• ClusterCluster
– The connected components of the networkThe connected components of the network• Clustering coefficient (CC)Clustering coefficient (CC)
– CCCCii: Fraction representing the number of links actually present re: Fraction representing the number of links actually present relative to the total possible number of links among the vertices in ilative to the total possible number of links among the vertices in its neighborhood.ts neighborhood.
– CC: average of all CCCC: average of all CCii in a network in a network• Degree distributionDegree distribution
– The distribution of degrees throughout a networkThe distribution of degrees throughout a network• Major clusterMajor cluster
– The largest cluster in the networkThe largest cluster in the network
Degree Distribution: Degree Distribution: DevelopersDevelopers
Degree Distribution: Degree Distribution: ProjectsProjects
Diameter of Developer Diameter of Developer Network vs. TimeNetwork vs. Time
• Network Network size size increased increased from from 30,000 to 30,000 to 70,00070,000
Diameter of Project Diameter of Project Network vs. TimeNetwork vs. Time
• Network size inNetwork size increased from 2creased from 20,000 to 50,000.0,000 to 50,000.
• Diameter decreDiameter decreasing with time asing with time both for develoboth for developer network anper network and project netwd project networkork
Clustering Coefficient of Clustering Coefficient of Developer Network vs. TimeDeveloper Network vs. Time
Clustering Coefficient of Clustering Coefficient of Project Network vs. TimeProject Network vs. Time
Cluster Size DistributionCluster Size Distribution
• RR22 with with major major cluster is cluster is 0.74260.7426
• RR22 without without major major cluster is cluster is 0.9799 0.9799
Relative Size of Major Relative Size of Major Cluster vs. TimeCluster vs. Time
• Increase of Increase of the relative the relative size of the size of the major major clustercluster
• ApproachinApproaching steady-g steady-state?state?
An Example Research An Example Research QuestionQuestion
• What processes can explain the evolution What processes can explain the evolution of the project and developer social of the project and developer social networks?networks?– Randomly growing network (Erdos-Reyni, Randomly growing network (Erdos-Reyni,
1960)?1960)?– Evolving network with preferential attachment Evolving network with preferential attachment
(Barabasi-Albert, 1999)?(Barabasi-Albert, 1999)?– Evolving network with preferential attachment Evolving network with preferential attachment
and fitness (Barabasi-Albert, 2001)?and fitness (Barabasi-Albert, 2001)?– Others?Others?
Computer ExperimentsComputer Experiments
• Agent-based simulationsAgent-based simulations• Java programs using Swarm class libraryJava programs using Swarm class library
– Validation (docking) exercises using Java/RepastValidation (docking) exercises using Java/Repast
• Grow artificial SourceForge’s Grow artificial SourceForge’s (Epstein & Axtell, (Epstein & Axtell, 1996)1996)
– Parameterized with observed data, e.g., developer Parameterized with observed data, e.g., developer behaviorsbehaviors• Join ratesJoin rates• New project additionsNew project additions• Leave projectsLeave projects
– Evaluation of multiple models (hypotheses)Evaluation of multiple models (hypotheses)– Verification/validation Verification/validation
Cycles of Modeling & Cycles of Modeling & SimulationSimulation
Modeling(Hypothesis)
Agent -BasedSimulation(Experiment)
Observation
Social Network ModelsER => BA => BA+Fitness => BA+Dynamic Fitness
Grow ArtificialSourceForge
Analysis ofSourceForge
Data
Degree DistributionAverage Degree
DiameterClustering Coefficient
Cluster Size Distribution
Model for SourceForgeModel for SourceForge
• ABM based on bipartite graphABM based on bipartite graph• Model descriptionModel description
– Agent: developerAgent: developer– Behaviors: Create, join, abandon and idleBehaviors: Create, join, abandon and idle– Preference: developer’s and project’sPreference: developer’s and project’s– FitnessFitness
• Four models in iterationsFour models in iterations– ER, BA, BA with constant fitness and BA with ER, BA, BA with constant fitness and BA with
dynamic fitnessdynamic fitness
• Comparison of empirical and simulated Comparison of empirical and simulated datadata
ER Model – Degree ER Model – Degree DistributionDistribution
• Degree Degree distribution distribution is normal is normal distribution distribution while it is while it is power law power law in empirical in empirical datadata
• Fit Fails!Fit Fails!
ER Model - DiameterER Model - Diameter• Average degree Average degree
is decreasing is decreasing while it is while it is increasing in increasing in empirical dataempirical data
• Diameter is Diameter is increasing increasing while it is while it is decreasing in decreasing in empirical dataempirical data
• Fit Fails!Fit Fails!
ER Model – Clustering ER Model – Clustering CoefficientCoefficient
• Clustering Clustering coefficient is coefficient is relatively low relatively low under 0.3 under 0.3 while it is while it is around 0.7 in around 0.7 in empirical data.empirical data.
• Fit fails!Fit fails!
ER Model – Cluster Size ER Model – Cluster Size DistributionDistribution
• Power law Power law distribution with distribution with RR22 as 0.6667 as 0.6667 (0.9653 without (0.9653 without the major cluster) the major cluster) while Rwhile R22 in in empirical data is empirical data is 0.7426 (0.9799 0.7426 (0.9799 without the major without the major cluster)cluster)
• The actual The actual distribution is distribution is different from different from empirical dataempirical data
• Fit Fails!Fit Fails!
BA Model – Degree BA Model – Degree DistributionDistribution
• Power laws in degree Power laws in degree distributions, similar to distributions, similar to empirical data (o for empirical data (o for simulated data and x simulated data and x for empirical data).for empirical data).
• For developer For developer distribution: simulated distribution: simulated data has Rdata has R22 as 0.9798 as 0.9798 and empirical data has and empirical data has RR22 as 0.9714. as 0.9714.
• For project For project distribution: simulated distribution: simulated data has Rdata has R22 as 0.6650 as 0.6650 and empirical data has and empirical data has RR22 as 0.9838. as 0.9838.
• Partial Fit!Partial Fit!
BA Model – Diameter and BA Model – Diameter and Clustering CoefficientClustering Coefficient
• Small diameter Small diameter and high and high clustering clustering coefficient like coefficient like empirical dataempirical data
• Diameter and Diameter and clustering clustering coefficient are coefficient are both decreasing both decreasing like empirical like empirical datadata
• Good Fit!Good Fit!
BA Model with Constant BA Model with Constant FitnessFitness
• Power laws in degree Power laws in degree distributions, similar to distributions, similar to empirical data (o for empirical data (o for simulated data and x for simulated data and x for empirical data).empirical data).
• For developer distribution: For developer distribution: simulated data has Rsimulated data has R22 as as 0.9742 and empirical data 0.9742 and empirical data has Rhas R22 as 0.9714. as 0.9714.
• For project distribution: For project distribution: simulated data has Rsimulated data has R22 as as 0.7253 and empirical data 0.7253 and empirical data has Rhas R22 as 0.9838. as 0.9838.
• Improved fit!Improved fit!
Discovery: Project Life Discovery: Project Life CycleCycle
BA Model with Dynamic BA Model with Dynamic FitnessFitness
• Power laws in degree Power laws in degree distribution, similar to distribution, similar to empirical data (o for empirical data (o for simulated data and x simulated data and x for empirical data).for empirical data).
• For developer For developer distribution: simulated distribution: simulated data has Rdata has R22 as 0.9695 as 0.9695 and empirical data has and empirical data has RR22 as 0.9714. as 0.9714.
• For project distribution: For project distribution: simulated data has Rsimulated data has R22 as 0.8051 and empirical as 0.8051 and empirical data has Rdata has R22 as 0.9838. as 0.9838.
• Somewhat better fit!Somewhat better fit!
Models of the F/OSS Social Models of the F/OSS Social NetworkNetwork
(Alternative Hypotheses)(Alternative Hypotheses)• General model featuresGeneral model features– Agents are nodes on a graph (developers or projects) Agents are nodes on a graph (developers or projects) – Behaviors: Create, join, abandon and idleBehaviors: Create, join, abandon and idle– Edges are relationships (joint project participation)Edges are relationships (joint project participation)– Growth of network: random or types of preferential Growth of network: random or types of preferential
attachment, formation of clustersattachment, formation of clusters– FitnessFitness – Network attributes: diameter, average degree, Network attributes: diameter, average degree,
degree distribution, clustering coefficientdegree distribution, clustering coefficient• Four specific modelsFour specific models
– ER (random graph) - (1960)ER (random graph) - (1960)– BA (preferential attachment) - (1999)BA (preferential attachment) - (1999)– BA ( + constant fitness) - (2001)BA ( + constant fitness) - (2001)– BA ( + dynamic fitness) - (2003)BA ( + dynamic fitness) - (2003)
SummarySummary
SummarySummary
• Why Agent-Based Modeling and Simulation?Why Agent-Based Modeling and Simulation?– Can be used as components of the Scientific MethodCan be used as components of the Scientific Method– A research approach for studying socio-technical syA research approach for studying socio-technical sy
stemsstems• Case study: F/OSS - Collaboration Social NetworCase study: F/OSS - Collaboration Social Networ
ksks– SourceForge conceptual models: ER, BA, BA with coSourceForge conceptual models: ER, BA, BA with co
nstant fitness and BA with dynamic fitness.nstant fitness and BA with dynamic fitness.– Simulations Simulations
• Computer experiments that tested conceptual modelsComputer experiments that tested conceptual models• Provided insight into the phenomenon under study and gProvided insight into the phenomenon under study and g
uided data mining of collected observationsuided data mining of collected observations
QuestionsQuestions
• Validity of approachesValidity of approaches– Social networksSocial networks– SimulationSimulation
• Value/Utility of approachsValue/Utility of approachs• Applicability to other areas of F/OSS Applicability to other areas of F/OSS
researchresearch– Project sites, e.g., Mozilla.orgProject sites, e.g., Mozilla.org– Individual projects, e.g., Linux kernelIndividual projects, e.g., Linux kernel
Thank youThank you