an investigation into the free/open source software...

43
An Investigation into the Free/Open An Investigation into the Free/Open Source Software Phenomenon using Source Software Phenomenon using Data Mining, Social Network Theory, Data Mining, Social Network Theory, and Agent-Based and Agent-Based Greg Madey Computer Science & Engineering University of Notre Dame UIUC - NSF Workshop on Continuous (Re)Design of Open Source Software University of Illinois, Urbana-Champaign October 8-9, 2003 This research was partially supported by the US National Science Foundation, CISE/IIS- Digital Society & Technology, under Grant No. 0222829

Upload: duongdan

Post on 19-Jul-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

An Investigation into the Free/OpenAn Investigation into the Free/OpenSource Software Phenomenon usingSource Software Phenomenon using

Data Mining, Social Network Theory,Data Mining, Social Network Theory,and Agent-Basedand Agent-Based

Greg MadeyComputer Science & Engineering

University of Notre Dame

UIUC - NSF Workshop on Continuous (Re)Design ofOpen Source Software

University of Illinois, Urbana-ChampaignOctober 8-9, 2003

This research was partially supported by the US National Science Foundation, CISE/IIS-Digital Society & Technology, under Grant No. 0222829

ContributorsContributors

•• Vincent Vincent FreehFreeh, Computer Science, North Carolina State University (Principal, Computer Science, North Carolina State University (PrincipalInvestigator)Investigator)

•• Yongqin GaoYongqin Gao, Computer Science and Engineering, University of Notre Dame, Computer Science and Engineering, University of Notre Dame(Graduate Student)(Graduate Student)

•• Jeff Jeff GoettGoett, University of Notre Dame (REU Student), University of Notre Dame (REU Student)•• Chris Hoffman, University of Notre Dame (REU Student)Chris Hoffman, University of Notre Dame (REU Student)•• Nadir Nadir KiyanclarKiyanclar, University of Notre Dame (REU Student), University of Notre Dame (REU Student)•• Greg Greg MadeyMadey, Computer Science & Engineering, University of Notre Dame, Computer Science & Engineering, University of Notre Dame

(Principal Investigator)(Principal Investigator)•• Patrick McGovern, Director Patrick McGovern, Director SourceForgeSourceForge.net, VA Software (Industrial.net, VA Software (Industrial

Collaborator)Collaborator)•• Carlos Carlos SiuSiu, University of Notre Dame (REU Student), University of Notre Dame (REU Student)•• Renee Renee TynanTynan, Department of Management, College of Business, University of, Department of Management, College of Business, University of

Notre Dame (Principal Investigator)Notre Dame (Principal Investigator)•• Jin Jin XuXu, Computer Science & Engineering, University of Notre Dame (Graduate, Computer Science & Engineering, University of Notre Dame (Graduate

Student)Student)

OutlineOutline

•• Research approachResearch approach

•• Tools and definitions: Agents, models, simulations,Tools and definitions: Agents, models, simulations,collaborative social networks, computer experimentscollaborative social networks, computer experiments

•• Data collection and analysisData collection and analysis•• Example research questionExample research question•• SimulationSimulation

•• Computer experimentsComputer experiments•• ResultsResults

One Approach to ResearchingOne Approach to ResearchingF/OSSDF/OSSD

•• Online dataOnline data–– Screen scrapingScreen scraping

–– Database dumpsDatabase dumps

•• ModelingModeling–– Social network theorySocial network theory–– Evolutionary assumptionsEvolutionary assumptions

•• SimulationSimulation–– Verification and validationVerification and validation

–– Computer experimentsComputer experiments

•• Variation of Classical Scientific MethodVariation of Classical Scientific Method

Classical Scientific MethodClassical Scientific Method

1.1. Observe the worldObserve the worlda)a) Identify a puzzling phenomenonIdentify a puzzling phenomenon

2.2. Generate a falsifiable hypothesis Generate a falsifiable hypothesis (K. Popper)(K. Popper)

3.3. Design and conduct an experiment with theDesign and conduct an experiment with thegoal of disproving the hypothesisgoal of disproving the hypothesisa)a) If the experiment If the experiment ““failsfails””, then the hypothesis is, then the hypothesis is

accepted (until replaced)accepted (until replaced)b)b) If the experiment If the experiment ““succeedssucceeds””, then reject hypothesis,, then reject hypothesis,

but additional insight into the phenomenon may bebut additional insight into the phenomenon may beobtained and steps 2-3 repeatedobtained and steps 2-3 repeated

The Computer ExperimentThe Computer Experiment

Agent-Based Simulation asAgent-Based Simulation asa Component of thea Component of theScientific MethodScientific Method

Modeling(Hypothesis)

Agent -BasedSimulation(Experiment)

Observation

Agent-Based Simulation asAgent-Based Simulation asa Component of thea Component of theScientific MethodScientific Method

Modeling(Hypothesis)

Agent -BasedSimulation(Experiment)

Observation

Social NetworkModel of F/OSS

Grow ArtificialSourceForge

Analysis ofSourceForge

Data

Agent-Based Modeling andAgent-Based Modeling andSimulationSimulation

•• Conceptual models of a phenomenonConceptual models of a phenomenon•• Simulations are computer implementations of theSimulations are computer implementations of the

conceptual modelsconceptual models•• Agents in models and simulations are distinct entitiesAgents in models and simulations are distinct entities

(instantiated objects)(instantiated objects)–– Tend to be simple, but with large numbers of them (thousands, orTend to be simple, but with large numbers of them (thousands, or

more) - i.e., swarm intelligencemore) - i.e., swarm intelligence–– Contrasted with higher level AI Contrasted with higher level AI ““intelligent agentsintelligent agents””

•• Foundations in complexity theoryFoundations in complexity theory–– Self-organizationSelf-organization–– EmergenceEmergence

Collaborative Social NetworksCollaborative Social Networks•• Research-paper co-authorship, small world phenomenon, e.g., Research-paper co-authorship, small world phenomenon, e.g., ErdosErdos

number number ((Barabasi Barabasi 2001, Newman 2001)2001, Newman 2001)

•• Movie actors, small world phenomenon, e.g., Kevin Bacon numberMovie actors, small world phenomenon, e.g., Kevin Bacon number(Watts 1999, 2003)(Watts 1999, 2003)

•• Interlocking corporate directorshipsInterlocking corporate directorships•• Terrorist NetworksTerrorist Networks•• Open-source software developers Open-source software developers ((Madey Madey et al, AMCIS 2002)et al, AMCIS 2002)

•• Collaborators are nodes in a graph, and collaborative relationship areCollaborators are nodes in a graph, and collaborative relationship arethe edges of the graph => a framework to model data/phenomenonthe edges of the graph => a framework to model data/phenomenon

SourceForgeSourceForge

• VA Software• Part of OSDN• Started 12/1999• Collaboration tools• 70,000 Projects• 90,000 Developers• 700,00 RegisteredUsers

SavannahSavannah• SourceForgeSoftware?• Free SoftwareFoundation•1,600 Projects•16,000 RegisteredUsers

ObservationsObservations

•• Web miningWeb mining•• Web crawler (scripts)Web crawler (scripts)

–– PythonPython–– PerlPerl–– AWKAWK–– SedSed

•• MonthlyMonthly•• Since Jan 2001Since Jan 2001•• ProjectIDProjectID•• DeveloperIDDeveloperID•• Almost 2 million recordsAlmost 2 million records•• Relational databaseRelational database

PROJ|DEVELOPER8001|dev3788001|dev89758001|dev99728002|dev276508005|dev313518006|dev125098007|dev193958007|dev46228007|dev356118008|dev8975

Collaboration NetworksCollaboration Networks

Adapted from Newman, Strogatz and Watts, 2001

15850 dev[46]dev[83] 15850 dev[46]

dev[48]

15850 dev[46]dev[56]

15850 dev[46]dev[58]

6882 dev[58]dev[47]

6882 dev[47]dev[79]

6882 dev[47]dev[52]

6882 dev[47]dev[55]

7028 dev[46]dev[99]

7028 dev[46]dev[51]

7028 dev[46]dev[57] 7597 dev[46]

dev[45]

7597 dev[46]dev[72]

7597 dev[46]dev[55]

7597 dev[46]dev[58]

7597 dev[46]dev[61]

7597 dev[46]dev[64]7597 dev[46]

dev[67]

7597 dev[46]dev[70]

9859 dev[46]dev[49]9859 dev[46]

dev[53]

9859 dev[46]dev[54]

9859 dev[46]dev[59]

dev[46]

dev[83] dev[56]

dev[48]

dev[52]

dev[79]

dev[72]

dev[51]

dev[57]

dev[55]

dev[99]

dev[47]

dev[58]

dev[53]

dev[58]

dev[65]

dev[45]

dev[70]

dev[67]

dev[59]

dev[54]

dev[49]

dev[64]

dev[61]

Project 6882

Project 9859

Project 7597

Project 7028

Project 15850

F/OSS Developers - Collaboration Social NetworkDevelopers are nodes / Projects are links

24 Developers5 Projects

2 Linchpin Developers1 Cluster

Topological Analysis of the DataTopological Analysis of the Data

•• Statistics inspectedStatistics inspected–– DiameterDiameter

–– Average degreeAverage degree

–– Clustering coefficientClustering coefficient

–– Degree distributionDegree distribution–– Cluster size distributionCluster size distribution

–– Relative size of major clusterRelative size of major cluster

–– Fitness and life cycleFitness and life cycle

•• Evolution of these statisticsEvolution of these statistics

•• Dual networksDual networks–– developer network and project networkdeveloper network and project network

TerminologyTerminology•• DiameterDiameter

–– Average length of shortest paths between all pairs of verticesAverage length of shortest paths between all pairs of vertices•• DegreeDegree

–– The count of edges connected to given vertexThe count of edges connected to given vertex•• Average degreeAverage degree

–– Average of the degrees of all vertices in the networkAverage of the degrees of all vertices in the network•• ClusterCluster

–– The connected components of the networkThe connected components of the network•• Clustering coefficient (CC)Clustering coefficient (CC)

–– CCCCii: Fraction representing the number of links actually present relative to the total: Fraction representing the number of links actually present relative to the totalpossible number of links among the vertices in its neighborhood.possible number of links among the vertices in its neighborhood.

–– CC: average of all CC: average of all CCCCii in a networkin a network•• Degree distributionDegree distribution

–– The distribution of degrees throughout a networkThe distribution of degrees throughout a network•• Major clusterMajor cluster

–– The largest cluster in the networkThe largest cluster in the network

Degree Distribution: DevelopersDegree Distribution: Developers

Degree Distribution: ProjectsDegree Distribution: Projects

Diameter of DeveloperDiameter of DeveloperNetwork vs. TimeNetwork vs. Time

•• Network sizeNetwork sizeincreasedincreasedfrom 30,000from 30,000to 70,000to 70,000

Diameter of ProjectDiameter of ProjectNetwork vs. TimeNetwork vs. Time

•• Network sizeNetwork sizeincreased fromincreased from20,000 to 50,000.20,000 to 50,000.

•• DiameterDiameterdecreasing withdecreasing withtime both fortime both fordeveloper networkdeveloper networkand projectand projectnetworknetwork

Clustering Coefficient of DeveloperClustering Coefficient of DeveloperNetwork vs. TimeNetwork vs. Time

Clustering Coefficient of ProjectClustering Coefficient of ProjectNetwork vs. TimeNetwork vs. Time

Cluster Size DistributionCluster Size Distribution

•• RR22 with major with majorcluster iscluster is0.74260.7426

•• RR22 without withoutmajor cluster ismajor cluster is0.97990.9799

Relative Size of Major Cluster vs.Relative Size of Major Cluster vs.TimeTime

•• Increase of theIncrease of therelative size ofrelative size ofthe majorthe majorclustercluster

•• ApproachingApproachingsteady-state?steady-state?

An Example Research QuestionAn Example Research Question

•• What processes can explain the evolution of theWhat processes can explain the evolution of theproject and developer social networks?project and developer social networks?–– Randomly growing network (Randomly growing network (ErdosErdos--ReyniReyni, 1960)?, 1960)?

–– Evolving network with preferential attachmentEvolving network with preferential attachment((BarabasiBarabasi-Albert, 1999)?-Albert, 1999)?

–– Evolving network with preferential attachment andEvolving network with preferential attachment andfitness (fitness (BarabasiBarabasi-Albert, 2001)?-Albert, 2001)?

–– Others?Others?

Computer ExperimentsComputer Experiments

•• Agent-based simulationsAgent-based simulations

•• Java programs using Swarm class libraryJava programs using Swarm class library–– Validation (docking) exercises using Java/RepastValidation (docking) exercises using Java/Repast

•• Grow artificial Grow artificial SourceForgeSourceForge’’s s (Epstein & Axtell, 1996)(Epstein & Axtell, 1996)

–– Parameterized with observed data, e.g., developer behaviorsParameterized with observed data, e.g., developer behaviors•• Join ratesJoin rates•• New project additionsNew project additions

•• Leave projectsLeave projects

–– Evaluation of multiple models (hypotheses)Evaluation of multiple models (hypotheses)

–– Verification/validationVerification/validation

Cycles of Modeling & SimulationCycles of Modeling & Simulation

Modeling(Hypothesis)

Agent -BasedSimulation(Experiment)

Observation

Social Network ModelsER => BA => BA+Fitness => BA+Dynamic Fitness

Grow ArtificialSourceForge

Analysis ofSourceForge

Data

Degree DistributionAverage Degree

DiameterClustering Coefficient

Cluster Size Distribution

Model for SourceForgeModel for SourceForge

•• ABM based on bipartite graphABM based on bipartite graph•• Model descriptionModel description

–– Agent: developerAgent: developer–– Behaviors: Create, join, abandon and idleBehaviors: Create, join, abandon and idle–– Preference: developerPreference: developer’’s and projects and project’’ss–– FitnessFitness

•• Four models in iterationsFour models in iterations–– ER, BA, BA with constant fitness and BA with dynamicER, BA, BA with constant fitness and BA with dynamic

fitnessfitness

•• Comparison of empirical and simulated dataComparison of empirical and simulated data

ER Model ER Model –– Degree Distribution Degree Distribution

•• DegreeDegreedistribution isdistribution isnormalnormaldistributiondistributionwhile it iswhile it ispower law inpower law inempirical dataempirical data

•• Fit Fails!Fit Fails!

ER Model - DiameterER Model - Diameter

•• Average degree isAverage degree isdecreasing while itdecreasing while itis increasing inis increasing inempirical dataempirical data

•• Diameter isDiameter isincreasing while itincreasing while itis decreasing inis decreasing inempirical dataempirical data

•• Fit Fails!Fit Fails!

ER Model ER Model –– Clustering ClusteringCoefficientCoefficient

•• ClusteringClusteringcoefficient iscoefficient isrelatively low underrelatively low under0.3 while it is0.3 while it isaround 0.7 inaround 0.7 inempirical data.empirical data.

•• Fit fails!Fit fails!

ER Model ER Model –– Cluster Size Cluster SizeDistributionDistribution

•• Power law distributionPower law distributionwith Rwith R22 as 0.6667 as 0.6667(0.9653 without the(0.9653 without themajor cluster) whilemajor cluster) whileRR22 in empirical data is in empirical data is0.7426 (0.97990.7426 (0.9799without the majorwithout the majorcluster)cluster)

•• The actual distributionThe actual distributionis different fromis different fromempirical dataempirical data

•• Fit Fails!Fit Fails!

BA Model BA Model –– Degree Distribution Degree Distribution

•• Power laws in degreePower laws in degreedistributions, similar todistributions, similar toempirical data (o forempirical data (o forsimulated data and x forsimulated data and x forempirical data).empirical data).

•• For developerFor developerdistribution: simulateddistribution: simulateddata has Rdata has R22 as 0.9798 as 0.9798and empirical data has Rand empirical data has R22

as 0.9714.as 0.9714.•• For project distribution:For project distribution:

simulated data has Rsimulated data has R22 as as0.6650 and empirical0.6650 and empiricaldata has Rdata has R22 as 0.9838. as 0.9838.

•• Partial Fit!Partial Fit!

BA Model BA Model –– Diameter and Diameter andClustering CoefficientClustering Coefficient

•• Small diameter andSmall diameter andhigh clusteringhigh clusteringcoefficient likecoefficient likeempirical dataempirical data

•• Diameter andDiameter andclusteringclusteringcoefficient arecoefficient areboth decreasingboth decreasinglike empirical datalike empirical data

•• Good Fit!Good Fit!

BA Model with Constant FitnessBA Model with Constant Fitness

•• Power laws in degreePower laws in degreedistributions, similar todistributions, similar toempirical data (o for simulatedempirical data (o for simulateddata and x for empirical data).data and x for empirical data).

•• For developer distribution:For developer distribution:simulated data has Rsimulated data has R22 as as0.9742 and empirical data has0.9742 and empirical data hasRR22 as 0.9714. as 0.9714.

•• For project distribution:For project distribution:simulated data has Rsimulated data has R22 as as0.7253 and empirical data has0.7253 and empirical data hasRR22 as 0.9838. as 0.9838.

•• Improved fit!Improved fit!

Discovery: Project Life CycleDiscovery: Project Life Cycle

BA Model with Dynamic FitnessBA Model with Dynamic Fitness

•• Power laws in degreePower laws in degreedistribution, similar todistribution, similar toempirical data (o forempirical data (o forsimulated data and x forsimulated data and x forempirical data).empirical data).

•• For developer distribution:For developer distribution:simulated data has Rsimulated data has R22 as as0.9695 and empirical data has0.9695 and empirical data hasRR22 as 0.9714. as 0.9714.

•• For project distribution:For project distribution:simulated data has Rsimulated data has R22 as as0.8051 and empirical data has0.8051 and empirical data hasRR22 as 0.9838. as 0.9838.

•• Somewhat better fit!Somewhat better fit!

Models of the F/OSS Social NetworkModels of the F/OSS Social Network(Alternative Hypotheses)(Alternative Hypotheses)

•• General model featuresGeneral model features–– Agents are nodes on a graph (developers or projects)Agents are nodes on a graph (developers or projects)–– Behaviors: Create, join, abandon and idleBehaviors: Create, join, abandon and idle–– Edges are relationships (joint project participation)Edges are relationships (joint project participation)–– Growth of network: random or types of preferentialGrowth of network: random or types of preferential

attachment, formation of clustersattachment, formation of clusters–– FitnessFitness–– Network attributes: diameter, average degree, degreeNetwork attributes: diameter, average degree, degree

distribution, clustering coefficientdistribution, clustering coefficient•• Four specific modelsFour specific models

–– ER (random graph) - (1960)ER (random graph) - (1960)–– BA (preferential attachment) - (1999)BA (preferential attachment) - (1999)–– BA ( + constant fitness) - (2001)BA ( + constant fitness) - (2001)–– BA ( + dynamic fitness) - (2003)BA ( + dynamic fitness) - (2003)

SummarySummary

SummarySummary

•• Why Agent-Based Modeling and Simulation?Why Agent-Based Modeling and Simulation?–– Can be used as components of the Scientific MethodCan be used as components of the Scientific Method–– A research approach for studying socio-technicalA research approach for studying socio-technical

systemssystems

•• Case study: F/OSS - Collaboration Social NetworksCase study: F/OSS - Collaboration Social Networks–– SourceForge SourceForge conceptual models: ER, BA, BA withconceptual models: ER, BA, BA with

constant fitness and BA with dynamic fitness.constant fitness and BA with dynamic fitness.–– SimulationsSimulations

•• Computer experiments that tested conceptual modelsComputer experiments that tested conceptual models•• Provided insight into the phenomenon under study and guidedProvided insight into the phenomenon under study and guided

data mining of collected observationsdata mining of collected observations

QuestionsQuestions

•• Validity of approachesValidity of approaches–– Social networksSocial networks

–– SimulationSimulation

•• Value/Utility of approachsValue/Utility of approachs

•• Applicability to other areas of F/OSS researchApplicability to other areas of F/OSS research–– Project sites, e.g., Project sites, e.g., MozillaMozilla.org.org–– Individual projects, e.g., Linux kernelIndividual projects, e.g., Linux kernel

Thank youThank you