analysis of the open source software development community using st mining: a research plan
DESCRIPTION
Analysis of the Open Source Software development community using ST mining: A Research Plan. Yongqin Gao, Greg Madey Computer Science & Engineering University of Notre Dame NAACSOS Conference Notre Dame, IN June 26-28, 2005. Outline. Background Motivation Problem definition - PowerPoint PPT PresentationTRANSCRIPT
Supported in part by the National Science
Foundation – ISS/Digital Science & Technology
Analysis of the Open SoAnalysis of the Open Source Software developmurce Software development community using Sent community using S
T mining:T mining:A Research PlanA Research PlanYongqin Gao, Greg MadeyYongqin Gao, Greg Madey
Computer Science & EngineerinComputer Science & Engineeringg
University of Notre DameUniversity of Notre Dame
NAACSOS ConferenceNAACSOS ConferenceNotre Dame, INNotre Dame, INJune 26-28, 2005June 26-28, 2005
OutlineOutline
BackgroundBackground MotivationMotivation Problem definitionProblem definition Research data Research data MethodologyMethodology ConclusionConclusion
Background (OSS)Background (OSS) What is OSS?What is OSS?
Free to use, modify and distributeFree to use, modify and distribute Source code available and modifiableSource code available and modifiable
Potential advantages over commercial softwarePotential advantages over commercial software Transparent and easy adoptionTransparent and easy adoption Fast developmentFast development Low costLow cost Potential high qualityPotential high quality
Why study OSS?Why study OSS? Software engineering — new development and coordination Software engineering — new development and coordination
methodsmethods Open content — model for other forms of open, shared Open content — model for other forms of open, shared
collaborationcollaboration Complexity — successful example of self-organization/emergenceComplexity — successful example of self-organization/emergence Growing popularityGrowing popularity Non-traditional governance and project management practicesNon-traditional governance and project management practices Virtual --> Data!Virtual --> Data!
Open Source Software Open Source Software (OSS)(OSS)
Free …Free … to view sourceto view source to modifyto modify to shareto share of costof cost
ExamplesExamples ApacheApache PerlPerl GNUGNU LinuxLinux SendmailSendmail PythonPython KDEKDE GNOMEGNOME MozillaMozilla Thousands moreThousands more
Linux
GNU
Savannah
LeadersLeaders
Linus TolvaldsLinux
Larry WallPerl
Richard StallmanGNU Manifesto
Eric RaymondCathedral and Bazaar
Success of ApacheSuccess of Apache
Almost 70% Market Share Almost 70% Market Share (Netcraft.com)(Netcraft.com)
Research ApproachResearch Approach
Parameter Values
Structural Features
Parameter Values
Cross Validation
Structural Features
Combined Data MiningParameter Values
Understanding the Social and Task
Dynamics that Predict Developer Behaviors
Social Network Analysis : Longitudinal
Study of Preferential Attachment and Dynamic
Attachment
Conceptual Explanatory Model of
OSS: Agent-Based Modeling and Simulation
Opportunity: Huge amounts of relatively
good data
SourceForge.netSourceForge.net
• VA Software• Part of OSDN• Started 12/1999• Collaboration tools• 100 K Projects• 100 K Developers• 1 M Registered Users
150 GBytes of Data & 150 GBytes of Data & GrowingGrowing
15850 dev[46]dev[83] 15850 dev[46]
dev[48]
15850 dev[46]dev[56]
15850 dev[46]dev[58]
6882 dev[58]dev[47]
6882 dev[47]dev[79]
6882 dev[47]dev[52]
6882 dev[47]dev[55]
7028 dev[46]dev[99]
7028 dev[46]dev[51]
7028 dev[46]dev[57]
7597 dev[46]dev[45]
7597 dev[46]dev[72]
7597 dev[46]dev[55]
7597 dev[46]dev[58]
7597 dev[46]dev[61]
7597 dev[46]dev[64]7597 dev[46]
dev[67]
7597 dev[46]dev[70]
9859 dev[46]dev[49]9859 dev[46]
dev[53]
9859 dev[46]dev[54]
9859 dev[46]dev[59]
dev[46]
dev[83] dev[56]
dev[48]
dev[52]
dev[79]
dev[72]
dev[51]
dev[57]
dev[55]
dev[99]
dev[47]
dev[58]
dev[53]
dev[58]
dev[65]
dev[45]
dev[70]
dev[67]
dev[59]
dev[54]
dev[49]
dev[64]
dev[61]
Project 6882
Project 9859
Project 7597
Project 7028
Project 15850
OSS Developer - Social NetworkDevelopers are nodes / Projects are links
24 Developers5 Projects
2 Linchpin Developers1 Cluster
Scale free distribution: Scale free distribution: developer participationdeveloper participation
# projects # of developers on that many projects
1 214882 36883 10864 4135 1776 767 358 219 910 611 512 615 116 117 1
y =10.6905 - 3.70892 x
R2 = 0.979906
0.5 1 1.5 2 2.5
2
4
6
8
10
Log( # of Projects)
Log
(# o
f D
evel
oper
s)
Scale Free – Power Law (developers)
Scale free distribution: Scale free distribution: project sizesproject sizes
Scale Free – Power Law (projects)
Background (DM)Background (DM) Characteristics of data setCharacteristics of data set
Incomplete, noisy, redundantIncomplete, noisy, redundant Complex structures, unstructuredComplex structures, unstructured HeterogeneousHeterogeneous Database not designed for research, but to support projeDatabase not designed for research, but to support proje
ct management services of SourceForge.netct management services of SourceForge.net Temporal data is available, but not everything a researcTemporal data is available, but not everything a researc
her would wanther would want Inferencing/discovery of temporal data potentially valuaInferencing/discovery of temporal data potentially valua
ble opportunityble opportunity What is DM (Data mining)What is DM (Data mining)
Nontrivial extraction of implicit, previously unknown aNontrivial extraction of implicit, previously unknown and potentially useful information from data.nd potentially useful information from data.
Data Mining Data Mining ProcedureProcedure
Raw data
Relevant data
Feature selection
Algorithm application
Result Evaluation
Data Integration
Data Pre-processing
Database
Spatial-temporal DM (1)Spatial-temporal DM (1)
Temporal data miningTemporal data mining Discover the behavior-based knowledge Discover the behavior-based knowledge
instead of state-based knowledge.instead of state-based knowledge. Example: many wolves -> fewer rabbitsExample: many wolves -> fewer rabbits Relationship between timely feedback Relationship between timely feedback
and quality of software/success of the and quality of software/success of the OSS projectOSS project
Spatio-temporal DMSpatio-temporal DM New research domain: Spatio-temporal data New research domain: Spatio-temporal data
miningmining Growing interest in spatio-temporal data miningGrowing interest in spatio-temporal data mining
Recommender systemsRecommender systems Location based servicesLocation based services Time based servicesTime based services GIS applicationsGIS applications
Extension of classic data mining techniques into data Extension of classic data mining techniques into data set with spatial and temporal properties.set with spatial and temporal properties.
Challenges: complexity of spatial information and Challenges: complexity of spatial information and difficulty in reasoning temporal information, e.g., difficulty in reasoning temporal information, e.g.,
IntervalsIntervals PointsPoints HybridsHybrids
MotivationsMotivations
Limitations of OSS research to dateLimitations of OSS research to date Mostly feature based data mining to dateMostly feature based data mining to date Neglecting of the inherent spatial and temNeglecting of the inherent spatial and tem
poral information in the OSS communityporal information in the OSS community SourceForge.net propertiesSourceForge.net properties
Spatial informationSpatial information Collaboration networkCollaboration network
Temporal informationTemporal information History data and log tablesHistory data and log tables
Spatial information in Spatial information in OSS? OSS?
The collaboration network in SFThe collaboration network in SF Study of the topology of the collaboration network.Study of the topology of the collaboration network. The network can be mapped as a graphThe network can be mapped as a graph
This graph is a non-Metric spaceThis graph is a non-Metric space Spread of ideas (software engineering tools and Spread of ideas (software engineering tools and
practices, new project opportunities)practices, new project opportunities)
Temporal information in OSTemporal information in OSSS
The network is evolving and the historiThe network is evolving and the histories of the site and individual entities coes of the site and individual entities comprise the temporal information in thmprise the temporal information in the network. e network.
Discrete time pointsDiscrete time points All the statistics are collected periodically.All the statistics are collected periodically.
Partially ordered eventsPartially ordered events Multiple timelines existed in the systemMultiple timelines existed in the system
?ab c
d
ST MiningST Mining
Different from classic data miningDifferent from classic data mining Spatial and temporal relationships are Spatial and temporal relationships are
complicatedcomplicated Metric and non-metric spatial relationsMetric and non-metric spatial relations Temporal relationsTemporal relations
Intrinsic dependency and heterogeneityIntrinsic dependency and heterogeneity Scale effect in space and timeScale effect in space and time
Significant modification of many Significant modification of many data mining techniques are needed.data mining techniques are needed.
Problem definition IProblem definition I
Dependency analysisDependency analysis Extension of associations to ST miningExtension of associations to ST mining
Complicated associationsComplicated associations Vertical (temporal) and horizontal (spatial) Vertical (temporal) and horizontal (spatial)
associationsassociations Combination of vertical and horizontal associationsCombination of vertical and horizontal associations
Examples: lag effects between projectsExamples: lag effects between projects
Flexible associationsFlexible associations Huge volume and scale effect of spatial-temporal Huge volume and scale effect of spatial-temporal
data set introduce noise and errordata set introduce noise and error Strict association is difficult to defineStrict association is difficult to define
Problem definition IIProblem definition II
Topic of this study: prediction Topic of this study: prediction supportsupport Clustering: group the projects with Clustering: group the projects with
similar evolution.similar evolution. Summarization: summarize the Summarization: summarize the
representative characteristics of representative characteristics of different project evolution patternsdifferent project evolution patterns
Prediction: predict the project evolution Prediction: predict the project evolution (based on the pattern discovered) (based on the pattern discovered)
Research DataResearch Data
SourceForge.net database dump June 20SourceForge.net database dump June 200505 117 tables117 tables Records up to 30 million per tableRecords up to 30 million per table 23 Gigabytes23 Gigabytes PostgreSQLPostgreSQL
Three types of tablesThree types of tables Data tablesData tables History tablesHistory tables Statistics tablesStatistics tables
MethodologyMethodology
Project development statisticsProject development statistics Numerical statistics.Numerical statistics. Expertise and survey statistics. Expertise and survey statistics.
Time series analysisTime series analysis Generate the time series for these statisticsGenerate the time series for these statistics
Classification generationClassification generation ABN algorithm usedABN algorithm used
Classifier evaluationClassifier evaluation Evaluation by comparing the predicted Evaluation by comparing the predicted
class with the actual classclass with the actual class
Numerical statisticsNumerical statistics
Statistics tables have the information Statistics tables have the information about project historyabout project history Stats_project_monthsStats_project_months Every record stands for a monthly history of a Every record stands for a monthly history of a
single projectsingle project Records from November 1999 to June 2005Records from November 1999 to June 2005
There are 24 attributes in every recordThere are 24 attributes in every record Descriptive attributes (3)Descriptive attributes (3) Statistics (numeric) attributes (21)Statistics (numeric) attributes (21)
We use the statistics attributesWe use the statistics attributes
Statistics AttributesStatistics AttributesAttributesAttributes
DevelopersDevelopers Patches_openedPatches_opened
DownloadsDownloads Patches_closedPatches_closed
Subdomain_ViewsSubdomain_Views Artifacts_openedArtifacts_opened
Page_viewsPage_views Artifacts_closedArtifacts_closed
File_releasesFile_releases Tasks_openedTasks_opened
Msg_postedMsg_posted Tasks_closedTasks_closed
Bug_openedBug_opened Help_requestsHelp_requests
Bug_closedBug_closed CVS_checkoutsCVS_checkouts
Support_openedSupport_opened CVS_commitsCVS_commits
Site_viewsSite_views CVS_addsCVS_adds
Support_closedSupport_closed
Expertise statisticsExpertise statistics
Rating scoresRating scores Expertise ratingExpertise rating User ratingUser rating
Importance parameterImportance parameter Domain importanceDomain importance Contribution parameterContribution parameter
Time SeriesTime Series
Time series used to describe the history Time series used to describe the history of each attribute.of each attribute. Time series: an ordered sequence of values Time series: an ordered sequence of values
of a variable at equally spaced time intervals.of a variable at equally spaced time intervals. The available monthly values of each statistic The available monthly values of each statistic
is used to generate the time series.is used to generate the time series. Goal is to study the project history Goal is to study the project history
patterns.patterns. DescriptionDescription PredictionPrediction
ConclusionConclusion
Project prediction using ST miningProject prediction using ST mining We used statistics to predict the project We used statistics to predict the project
developmentdevelopment Calibration using new data is important Calibration using new data is important
to keep the prediction valid.to keep the prediction valid.
QuestionsQuestions