analysis of the open source software development community using st mining: a research plan

30
Supported in part by the National Science Foundation – ISS/Digital Science & Technology Analysis of the Ope Analysis of the Ope n Source Software d n Source Software d evelopment communit evelopment communit y using ST mining: y using ST mining: A Research Plan A Research Plan Yongqin Gao, Greg Madey Yongqin Gao, Greg Madey Computer Science & Enginee Computer Science & Enginee ring ring University of Notre Dame University of Notre Dame NAACSOS Conference NAACSOS Conference Notre Dame, IN Notre Dame, IN June 26-28, 2005 June 26-28, 2005

Upload: bryga

Post on 13-Jan-2016

22 views

Category:

Documents


1 download

DESCRIPTION

Analysis of the Open Source Software development community using ST mining: A Research Plan. Yongqin Gao, Greg Madey Computer Science & Engineering University of Notre Dame NAACSOS Conference Notre Dame, IN June 26-28, 2005. Outline. Background Motivation Problem definition - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Analysis of the Open Source Software development community using ST mining: A Research Plan

Supported in part by the National Science

Foundation – ISS/Digital Science & Technology

Analysis of the Open SoAnalysis of the Open Source Software developmurce Software development community using Sent community using S

T mining:T mining:A Research PlanA Research PlanYongqin Gao, Greg MadeyYongqin Gao, Greg Madey

Computer Science & EngineerinComputer Science & Engineeringg

University of Notre DameUniversity of Notre Dame

NAACSOS ConferenceNAACSOS ConferenceNotre Dame, INNotre Dame, INJune 26-28, 2005June 26-28, 2005

Page 2: Analysis of the Open Source Software development community using ST mining: A Research Plan

OutlineOutline

BackgroundBackground MotivationMotivation Problem definitionProblem definition Research data Research data MethodologyMethodology ConclusionConclusion

Page 3: Analysis of the Open Source Software development community using ST mining: A Research Plan

Background (OSS)Background (OSS) What is OSS?What is OSS?

Free to use, modify and distributeFree to use, modify and distribute Source code available and modifiableSource code available and modifiable

Potential advantages over commercial softwarePotential advantages over commercial software Transparent and easy adoptionTransparent and easy adoption Fast developmentFast development Low costLow cost Potential high qualityPotential high quality

Why study OSS?Why study OSS? Software engineering — new development and coordination Software engineering — new development and coordination

methodsmethods Open content — model for other forms of open, shared Open content — model for other forms of open, shared

collaborationcollaboration Complexity — successful example of self-organization/emergenceComplexity — successful example of self-organization/emergence Growing popularityGrowing popularity Non-traditional governance and project management practicesNon-traditional governance and project management practices Virtual --> Data!Virtual --> Data!

Page 4: Analysis of the Open Source Software development community using ST mining: A Research Plan

Open Source Software Open Source Software (OSS)(OSS)

Free …Free … to view sourceto view source to modifyto modify to shareto share of costof cost

ExamplesExamples ApacheApache PerlPerl GNUGNU LinuxLinux SendmailSendmail PythonPython KDEKDE GNOMEGNOME MozillaMozilla Thousands moreThousands more

Linux

GNU

Savannah

Page 5: Analysis of the Open Source Software development community using ST mining: A Research Plan

LeadersLeaders

Linus TolvaldsLinux

Larry WallPerl

Richard StallmanGNU Manifesto

Eric RaymondCathedral and Bazaar

Page 6: Analysis of the Open Source Software development community using ST mining: A Research Plan

Success of ApacheSuccess of Apache

Almost 70% Market Share Almost 70% Market Share (Netcraft.com)(Netcraft.com)

Page 7: Analysis of the Open Source Software development community using ST mining: A Research Plan

Research ApproachResearch Approach

Parameter Values

Structural Features

Parameter Values

Cross Validation

Structural Features

Combined Data MiningParameter Values

Understanding the Social and Task

Dynamics that Predict Developer Behaviors

Social Network Analysis : Longitudinal

Study of Preferential Attachment and Dynamic

Attachment

Conceptual Explanatory Model of

OSS: Agent-Based Modeling and Simulation

Opportunity: Huge amounts of relatively

good data

Page 8: Analysis of the Open Source Software development community using ST mining: A Research Plan

SourceForge.netSourceForge.net

• VA Software• Part of OSDN• Started 12/1999• Collaboration tools• 100 K Projects• 100 K Developers• 1 M Registered Users

Page 9: Analysis of the Open Source Software development community using ST mining: A Research Plan

150 GBytes of Data & 150 GBytes of Data & GrowingGrowing

Page 10: Analysis of the Open Source Software development community using ST mining: A Research Plan

15850 dev[46]dev[83] 15850 dev[46]

dev[48]

15850 dev[46]dev[56]

15850 dev[46]dev[58]

6882 dev[58]dev[47]

6882 dev[47]dev[79]

6882 dev[47]dev[52]

6882 dev[47]dev[55]

7028 dev[46]dev[99]

7028 dev[46]dev[51]

7028 dev[46]dev[57]

7597 dev[46]dev[45]

7597 dev[46]dev[72]

7597 dev[46]dev[55]

7597 dev[46]dev[58]

7597 dev[46]dev[61]

7597 dev[46]dev[64]7597 dev[46]

dev[67]

7597 dev[46]dev[70]

9859 dev[46]dev[49]9859 dev[46]

dev[53]

9859 dev[46]dev[54]

9859 dev[46]dev[59]

dev[46]

dev[83] dev[56]

dev[48]

dev[52]

dev[79]

dev[72]

dev[51]

dev[57]

dev[55]

dev[99]

dev[47]

dev[58]

dev[53]

dev[58]

dev[65]

dev[45]

dev[70]

dev[67]

dev[59]

dev[54]

dev[49]

dev[64]

dev[61]

Project 6882

Project 9859

Project 7597

Project 7028

Project 15850

OSS Developer - Social NetworkDevelopers are nodes / Projects are links

24 Developers5 Projects

2 Linchpin Developers1 Cluster

Page 11: Analysis of the Open Source Software development community using ST mining: A Research Plan

Scale free distribution: Scale free distribution: developer participationdeveloper participation

# projects # of developers on that many projects

1 214882 36883 10864 4135 1776 767 358 219 910 611 512 615 116 117 1

y =10.6905 - 3.70892 x

R2 = 0.979906

0.5 1 1.5 2 2.5

2

4

6

8

10

Log( # of Projects)

Log

(# o

f D

evel

oper

s)

Scale Free – Power Law (developers)

Page 12: Analysis of the Open Source Software development community using ST mining: A Research Plan

Scale free distribution: Scale free distribution: project sizesproject sizes

Scale Free – Power Law (projects)

Page 13: Analysis of the Open Source Software development community using ST mining: A Research Plan

Background (DM)Background (DM) Characteristics of data setCharacteristics of data set

Incomplete, noisy, redundantIncomplete, noisy, redundant Complex structures, unstructuredComplex structures, unstructured HeterogeneousHeterogeneous Database not designed for research, but to support projeDatabase not designed for research, but to support proje

ct management services of SourceForge.netct management services of SourceForge.net Temporal data is available, but not everything a researcTemporal data is available, but not everything a researc

her would wanther would want Inferencing/discovery of temporal data potentially valuaInferencing/discovery of temporal data potentially valua

ble opportunityble opportunity What is DM (Data mining)What is DM (Data mining)

Nontrivial extraction of implicit, previously unknown aNontrivial extraction of implicit, previously unknown and potentially useful information from data.nd potentially useful information from data.

Page 14: Analysis of the Open Source Software development community using ST mining: A Research Plan

Data Mining Data Mining ProcedureProcedure

Raw data

Relevant data

Feature selection

Algorithm application

Result Evaluation

Data Integration

Data Pre-processing

Database

Page 15: Analysis of the Open Source Software development community using ST mining: A Research Plan

Spatial-temporal DM (1)Spatial-temporal DM (1)

Temporal data miningTemporal data mining Discover the behavior-based knowledge Discover the behavior-based knowledge

instead of state-based knowledge.instead of state-based knowledge. Example: many wolves -> fewer rabbitsExample: many wolves -> fewer rabbits Relationship between timely feedback Relationship between timely feedback

and quality of software/success of the and quality of software/success of the OSS projectOSS project

Page 16: Analysis of the Open Source Software development community using ST mining: A Research Plan

Spatio-temporal DMSpatio-temporal DM New research domain: Spatio-temporal data New research domain: Spatio-temporal data

miningmining Growing interest in spatio-temporal data miningGrowing interest in spatio-temporal data mining

Recommender systemsRecommender systems Location based servicesLocation based services Time based servicesTime based services GIS applicationsGIS applications

Extension of classic data mining techniques into data Extension of classic data mining techniques into data set with spatial and temporal properties.set with spatial and temporal properties.

Challenges: complexity of spatial information and Challenges: complexity of spatial information and difficulty in reasoning temporal information, e.g., difficulty in reasoning temporal information, e.g.,

IntervalsIntervals PointsPoints HybridsHybrids

Page 17: Analysis of the Open Source Software development community using ST mining: A Research Plan

MotivationsMotivations

Limitations of OSS research to dateLimitations of OSS research to date Mostly feature based data mining to dateMostly feature based data mining to date Neglecting of the inherent spatial and temNeglecting of the inherent spatial and tem

poral information in the OSS communityporal information in the OSS community SourceForge.net propertiesSourceForge.net properties

Spatial informationSpatial information Collaboration networkCollaboration network

Temporal informationTemporal information History data and log tablesHistory data and log tables

Page 18: Analysis of the Open Source Software development community using ST mining: A Research Plan

Spatial information in Spatial information in OSS? OSS?

The collaboration network in SFThe collaboration network in SF Study of the topology of the collaboration network.Study of the topology of the collaboration network. The network can be mapped as a graphThe network can be mapped as a graph

This graph is a non-Metric spaceThis graph is a non-Metric space Spread of ideas (software engineering tools and Spread of ideas (software engineering tools and

practices, new project opportunities)practices, new project opportunities)

Page 19: Analysis of the Open Source Software development community using ST mining: A Research Plan

Temporal information in OSTemporal information in OSSS

The network is evolving and the historiThe network is evolving and the histories of the site and individual entities coes of the site and individual entities comprise the temporal information in thmprise the temporal information in the network. e network.

Discrete time pointsDiscrete time points All the statistics are collected periodically.All the statistics are collected periodically.

Partially ordered eventsPartially ordered events Multiple timelines existed in the systemMultiple timelines existed in the system

?ab c

d

Page 20: Analysis of the Open Source Software development community using ST mining: A Research Plan

ST MiningST Mining

Different from classic data miningDifferent from classic data mining Spatial and temporal relationships are Spatial and temporal relationships are

complicatedcomplicated Metric and non-metric spatial relationsMetric and non-metric spatial relations Temporal relationsTemporal relations

Intrinsic dependency and heterogeneityIntrinsic dependency and heterogeneity Scale effect in space and timeScale effect in space and time

Significant modification of many Significant modification of many data mining techniques are needed.data mining techniques are needed.

Page 21: Analysis of the Open Source Software development community using ST mining: A Research Plan

Problem definition IProblem definition I

Dependency analysisDependency analysis Extension of associations to ST miningExtension of associations to ST mining

Complicated associationsComplicated associations Vertical (temporal) and horizontal (spatial) Vertical (temporal) and horizontal (spatial)

associationsassociations Combination of vertical and horizontal associationsCombination of vertical and horizontal associations

Examples: lag effects between projectsExamples: lag effects between projects

Flexible associationsFlexible associations Huge volume and scale effect of spatial-temporal Huge volume and scale effect of spatial-temporal

data set introduce noise and errordata set introduce noise and error Strict association is difficult to defineStrict association is difficult to define

Page 22: Analysis of the Open Source Software development community using ST mining: A Research Plan

Problem definition IIProblem definition II

Topic of this study: prediction Topic of this study: prediction supportsupport Clustering: group the projects with Clustering: group the projects with

similar evolution.similar evolution. Summarization: summarize the Summarization: summarize the

representative characteristics of representative characteristics of different project evolution patternsdifferent project evolution patterns

Prediction: predict the project evolution Prediction: predict the project evolution (based on the pattern discovered) (based on the pattern discovered)

Page 23: Analysis of the Open Source Software development community using ST mining: A Research Plan

Research DataResearch Data

SourceForge.net database dump June 20SourceForge.net database dump June 200505 117 tables117 tables Records up to 30 million per tableRecords up to 30 million per table 23 Gigabytes23 Gigabytes PostgreSQLPostgreSQL

Three types of tablesThree types of tables Data tablesData tables History tablesHistory tables Statistics tablesStatistics tables

Page 24: Analysis of the Open Source Software development community using ST mining: A Research Plan

MethodologyMethodology

Project development statisticsProject development statistics Numerical statistics.Numerical statistics. Expertise and survey statistics. Expertise and survey statistics.

Time series analysisTime series analysis Generate the time series for these statisticsGenerate the time series for these statistics

Classification generationClassification generation ABN algorithm usedABN algorithm used

Classifier evaluationClassifier evaluation Evaluation by comparing the predicted Evaluation by comparing the predicted

class with the actual classclass with the actual class

Page 25: Analysis of the Open Source Software development community using ST mining: A Research Plan

Numerical statisticsNumerical statistics

Statistics tables have the information Statistics tables have the information about project historyabout project history Stats_project_monthsStats_project_months Every record stands for a monthly history of a Every record stands for a monthly history of a

single projectsingle project Records from November 1999 to June 2005Records from November 1999 to June 2005

There are 24 attributes in every recordThere are 24 attributes in every record Descriptive attributes (3)Descriptive attributes (3) Statistics (numeric) attributes (21)Statistics (numeric) attributes (21)

We use the statistics attributesWe use the statistics attributes

Page 26: Analysis of the Open Source Software development community using ST mining: A Research Plan

Statistics AttributesStatistics AttributesAttributesAttributes

DevelopersDevelopers Patches_openedPatches_opened

DownloadsDownloads Patches_closedPatches_closed

Subdomain_ViewsSubdomain_Views Artifacts_openedArtifacts_opened

Page_viewsPage_views Artifacts_closedArtifacts_closed

File_releasesFile_releases Tasks_openedTasks_opened

Msg_postedMsg_posted Tasks_closedTasks_closed

Bug_openedBug_opened Help_requestsHelp_requests

Bug_closedBug_closed CVS_checkoutsCVS_checkouts

Support_openedSupport_opened CVS_commitsCVS_commits

Site_viewsSite_views CVS_addsCVS_adds

Support_closedSupport_closed

Page 27: Analysis of the Open Source Software development community using ST mining: A Research Plan

Expertise statisticsExpertise statistics

Rating scoresRating scores Expertise ratingExpertise rating User ratingUser rating

Importance parameterImportance parameter Domain importanceDomain importance Contribution parameterContribution parameter

Page 28: Analysis of the Open Source Software development community using ST mining: A Research Plan

Time SeriesTime Series

Time series used to describe the history Time series used to describe the history of each attribute.of each attribute. Time series: an ordered sequence of values Time series: an ordered sequence of values

of a variable at equally spaced time intervals.of a variable at equally spaced time intervals. The available monthly values of each statistic The available monthly values of each statistic

is used to generate the time series.is used to generate the time series. Goal is to study the project history Goal is to study the project history

patterns.patterns. DescriptionDescription PredictionPrediction

Page 29: Analysis of the Open Source Software development community using ST mining: A Research Plan

ConclusionConclusion

Project prediction using ST miningProject prediction using ST mining We used statistics to predict the project We used statistics to predict the project

developmentdevelopment Calibration using new data is important Calibration using new data is important

to keep the prediction valid.to keep the prediction valid.

Page 30: Analysis of the Open Source Software development community using ST mining: A Research Plan

QuestionsQuestions