mining software archives to support software development
DESCRIPTION
Job application talk.TRANSCRIPT
Mining Software Archives to Support Software Development
Tom ZimmermannSaarland University
Software Development
BuildHello Calgary!
Software Development
Build
Collaboration
Collaboration
Collaboration
Comm. Archive
Collaboration
Comm. Archive
VersionArchive
Collaboration
Comm. Archive
Bug Database
VersionArchive
Collaboration
Comm. Archive
Bug Database
VersionArchive
Mining Software Archives
Mining Software Archives
Mining Software Archives
eROSE BugCache Vulture
eROSERelated Changes
(ICSE 2004, TSE 2005)
Tom Zimmermann • Saarland UniversityPeter Weißgerber • University of Trier
Stephan Diehl • University of Trier Andreas Zeller • Saarland University
Developers who changed this functionalso changed...
eROSE: Guiding Developers
PurchaseHistory
Customers who bought this item also
bought...
eROSE: Guiding Developers
PurchaseHistory
Customers who bought this item also
bought...
Version Archive
Developers who changed this function
also changed...
eROSE suggests further locations.
eROSE prevents incomplete changes.
Processing CVS data
Processing CVS data
Processing CVS data
1. Comparing files2. Building transactions
Comparing Files
A()
C()
E()
D()
B()
Comparing Files
A()
C()
E()
D()
B()
A()
B()
E()
F()
D()
Comparing Files
A()
C()
E()
D()
B()
A()
B()
E()
F()
D()
Comparing Files
Building Transactions
CVS
150,000
Building Transactions
CVS
150,000
createGeneralPage()createTextComparePage()fKeys[]initDefaults()buildnotes_compare.htmlPatchMessages.propertiesplugin.properties
2003-02-19 (aweinand): fixed #13332
Building Transactions
CVS
150,000
createGeneralPage()createTextComparePage()fKeys[]initDefaults()buildnotes_compare.htmlPatchMessages.propertiesplugin.properties
2003-02-19 (aweinand): fixed #13332
same author + message + time
Mining Associations
User changes fKeys[] and initDefaults()
Mining Associations
Mining Associations
EROSE finds past transactions
Mining Associations
fKeys[]initDefaults()...plugin.properties
#104223
fKeys[]initDefaults()...plugin.properties
#756fKeys[]initDefaults()...plugin.properties
#6721fKeys[]initDefaults()...plugin.properties
#21078
fKeys[]initDefaults()...plugin.properties
#42432fKeys[]initDefaults()...plugin.properties
#51345fKeys[]initDefaults()...plugin.properties
#59998fKeys[]initDefaults()...plugin.properties
#71003
fKeys[]initDefaults()...
#87264fKeys[]initDefaults()...plugin.properties
#91220fKeys[]initDefaults()...plugin.properties
#101823
EROSE finds past transactions
EROSE finds past transactions
fKeys[]initDefaults()...plugin.properties
#104223
Mining Associations
fKeys[]initDefaults()...plugin.properties
#756fKeys[]initDefaults()...plugin.properties
#6721fKeys[]initDefaults()...plugin.properties
#21078
fKeys[]initDefaults()...plugin.properties
#42432fKeys[]initDefaults()...plugin.properties
#51345fKeys[]initDefaults()...plugin.properties
#59998fKeys[]initDefaults()...plugin.properties
#71003
fKeys[]initDefaults()...
#87264fKeys[]initDefaults()...plugin.properties
#91220fKeys[]initDefaults()...plugin.properties
#101823
{fKeys[], initDefaults()} ⇒ {plugin.properties}Support 10, Confidence 10/11 = 0.909
PostgreSQL
Evaluation
jEdit KOffice
GIMP
PostgreSQL
Evaluation
jEdit KOffice
GIMPEROSE predicts 33% of all changed entities.(files: 44%)
PostgreSQL
Evaluation
jEdit KOffice
GIMPEROSE predicts 33% of all changed entities.(files: 44%)
In 70% of all transactions, EROSE’s topmost three suggestions contain a changed entity.(files: 72%)
PostgreSQL
Evaluation
jEdit KOffice
GIMPEROSE predicts 33% of all changed entities.(files: 44%)
In 70% of all transactions, EROSE’s topmost three suggestions contain a changed entity.(files: 72%)
EROSE learns quickly (within 30 days).
eROSERelated Changes
(ICSE 2004, TSE 2005)
non-program elements(documentation)
learns quickly
guides developers
`
BugCachePredicting Defects
(ASE 2006, ICSE 2007)
Sung Kim • MITTom Zimmermann • Saarland University
Jim Whitehead • Univ. of California SC Andreas Zeller • Saarland University
The Problem
How should we allocate our resources for quality assurance?
One Solution
List with elements that (will) have defects
List is adaptive, i.e., it changes over time
One Solution
List with elements that (will) have defects
List is adaptive, i.e., it changes over time
Cache
The BugCache Model
Cache size: 2
Hypothesis: Temporal locality between defects
What is loaded in the cache?
The BugCache Model
Cache size: 2
Hypothesis: Temporal locality between defects
What is loaded in the cache?
The BugCache Model
Cache size: 2
Hypothesis: Temporal locality between defects
What is loaded in the cache?
The BugCache Model
Cache size: 2
Hypothesis: Temporal locality between defects
What is loaded in the cache?
The BugCache Model
Cache size: 2
Hypothesis: Temporal locality between defects
What is loaded in the cache?
The BugCache Model
Miss
Cache size: 2
Hypothesis: Temporal locality between defects
What is loaded in the cache?
The BugCache Model
Miss
Cache size: 2
Hypothesis: Temporal locality between defects
What is loaded in the cache?
The BugCache Model
Miss
Cache size: 2
The BugCache Model
Miss
Cache size: 2
The BugCache Model
Miss Hit
Cache size: 2
The BugCache Model
Miss Hit
Cache size: 2
The BugCache Model
Miss Hit Miss
Cache size: 2
The BugCache Model
Miss Hit Miss
Cache size: 2
The BugCache Model
Miss Hit Miss
Cache size: 2
Hit rate = #Hits / #Defects = 33.3%
The BugCache Model
Miss Hit Miss
Cache size: 2
The BugCache Model
Miss Hit Miss
Cache size: 2
The BugCache Model
Miss Hit Miss Miss
Cache size: 2
The BugCache Model
Miss Hit Miss Miss
Cache size: 2
The BugCache Model
Miss Hit Miss Miss
Cache size: 2
Loading Elements
Temporal locality – as shown before
Spatial locality – load “nearby” elements (i.e., co-changed before)
Changed-entity locality – load changed elements
New-entity locality – load new elements
Initial pre-fetch – start with a loaded cache
Evaluation
PostgreSQLjEdit
Mozilla
Columba
Hit Rates
Methods Files
Project BugCache FixCache BugCache FixCache
Apache 1.3ColumbaEclipseJEditMozillaPostgreSQL Subversion
59.6%58.9%64.5%50.5%49.3%61.9%68.3%
61.5%67.6%71.6%48.9%55.0%59.2%43.8%
83.9%83.5%95.1%85.7%93.3%73.9%82.0%
81.5%83.0%95.0%85.4%88.0%71.0%81.3%
Cache size = 10%
Hit Rates
Methods Files
Project BugCache FixCache BugCache FixCache
Apache 1.3ColumbaEclipseJEditMozillaPostgreSQL Subversion
59.6%58.9%64.5%50.5%49.3%61.9%68.3%
61.5%67.6%71.6%48.9%55.0%59.2%43.8%
83.9%83.5%95.1%85.7%93.3%73.9%82.0%
81.5%83.0%95.0%85.4%88.0%71.0%81.3%
Cache size = 10%
Reasons for Hits
Spatial locality18%
Temporal locality60%
Initial pre-fetch18%
Initial pre-fetchTemporal localitySpatial localityChanged-entity localityNew-entity locality
Warning Developers
“Safe” Location(not in FixCache)
Risky Location(red, in FixCache)
BugCachePredicting Defects
(ASE 2006, ICSE 2007)
adaptive
hit rates of 71%~95%
temporal locality
VulturePredicting
Security Vulnerabilities(Work in Progress)
Stephan Neuhaus • Saarland University
Tom Zimmermann • Saarland UniversityAndreas Zeller • Saarland University
Firefox/Mozilla
14,368 C/C++ files (10,452 components) 1,012,512 revisions
228,365 commits>700 developers
14,368 C/C++ files (10,452 components) 1,012,512 revisions
228,365 commits>700 developers
Vulnerabilities
Vulnerabilities
Vulnerabilities0
Vulnerabilities
Security Advisory 2005-12
Title: Livefeed bookmarks can steal cookiesImpact: HighProducts: FirefoxDescription: Earlier versions of Firefox allowed javascript: and data: URLs as Livefeed bookmarks. When they updated the URL would be run in the context of the current page and could be used to steal cookies or data displayed on the page. If the user were on a page with elevated privileges (for example, about:config) when the Livefeed was updated, the feed URL could potentially run arbitrary code on the user's machine.
Vulnerabilities0
Vulnerabilities
Vulnerabilities0
Vulnerabilities
Security Advisory 2005-13
Title: Window Injection SpoofingSeverity: LowProducts: Firefox, Mozilla SuiteDescription: A website can inject content into a popup opened by another site if the target name of the popup window is known. An attacker who knows you are going to visit that other site could spoof the contents of the popup.
Vulnerabilities0
Vulnerabilities
Security Advisory 2005-14
Title: SSL "secure site" indicator spoofingSeverity: ModerateProducts: Firefox, Mozilla SuiteDescription: Various schemes were reported that could cause the "secure site" lock icon to appear and show certificate details for the wrong site. These could be used by phishers to make their spoofs look more legitimate, particularly in windows that hide the address bar showing the true location.
Security Advisory 2005-15Title: Heap overflow possible in UTF8 to Unicode conversionSeverity: HighProducts: Firefox, Thunderbird, Mozilla SuiteDescription: It is possible for a UTF8 string with invalid sequences to trigger a heap overflow of converted Unicode data. Exploitability would depend on the attackers ability to get the string into the buggy converter. General web content is converted elsewhere but we can't rule out the possibility of a successful attack.
Security Advisory 2005-16Title: Spoofing download and security dialogs with overlapping windowsSeverity: HighProducts: Firefox, Mozilla SuiteDescription: Michael Krax demonstrates that the download dialog and security dialogs can be spoofed by partially covering them with an overlapping window. Some users may not notice the OS window border and browser statusbar bisecting what appears to be a single dialog, and be convinced by the spoofing text of the top-most window to click on the "Allow" or "Open" button of the window below.
Vulnerabilities0
Security Advisory 2005-41Title: Privilege escalation via DOM property overridesSeverity: CriticalProducts: Firefox, Mozilla SuiteDescription: moz_bug_r_a4 reported several exploits giving an attacker the ability to install malicious code or steal data, requiring only that the user do commonplace actions like click on a link or open the context menu. The common cause in each case was privileged UI code ("chrome") being overly trusting of DOM nodes from the content window.
Security Advisory 2006-76Title: XSS using outer window's Function objectImpact: HighProducts: Firefox 2.0Description: moz_bug_r_a4 demonstrated that the Function prototype regression described in bug 355161 could be exploited to bypass the protections against cross site script (XSS) injection, which could be used to steal credentials or sensitive data from arbitrary sites or perform destructive actions on behalf of a logged-in user.
Vulnerabilities
Vulnerabilities0
Vulnerabilities
Vulnerabilities0
Vulnerabilities
components
vulnerable424
10,452
4.05%
Vulnerabilities0
What other components are
vulnerable?
Vulnerabilities
Vulnerabilities0
Vulnerabilities
Vulnerabilities0
Vulnerabilities
?
Vulnerabilities0
Is this new component likely to be vulnerable?
Vulnerabilities
?
Vulture
Vulnerability Database
Version Archive
CodeCodeCodeCodeRedo diagram
Vulture
Vulnerability Database
Version Archive
CodeCodeCodeCode
Vulture
Redo diagram
Component Component Component
Vulture
Vulnerability Database
Version Archive
CodeCodeCodeCode
Vulture
Redo diagram
Predictor
Component Component Component
Vulture
Vulnerability Database
Version Archive
CodeCodeCodeCode
Vulture
Redo diagram
Predictor
Code
Component Component Component
Vulture
Vulnerability Database
Version Archive
CodeCodeCodeCode
Vulture
Redo diagram
Correlations
Programmer Code Complexity
Correlations
Language
Code Complexity
Correlations
Language
Correlations
Language
Problem Domain
Correlations
Language
Imports
GUI Database Certificates OS
Imports
GUI Database Certificates OS
Imports
GUI Database Certificates OS
Imports
nsIContent.h
nsIContentUtils.h
nsIScriptSecurityManager.h
Example (1)
nsIContent.h
nsIContentUtils.h
nsIScriptSecurityManager.h
Example (1)
import
✘
✘
✘
✘
✘
✘
✘
✘
✘
✘
✘
✘
✘
✘
✘
✘
✘
✘
✘
✘
✘✔
nsIContent.h
nsIContentUtils.h
nsIScriptSecurityManager.h
Example (1)
import
95.5%
nsIPrivateDOMEvent.h
nsReadableUtils.h
Example (2)
nsIPrivateDOMEvent.h
nsReadableUtils.h
Example (2)
import
nsIPrivateDOMEvent.h
nsReadableUtils.h
Example (2)
import
100%
✘
✘
✘
✘
✘
✘
✘
✘
✘
✘
✘
✘
✘
✘
✘
✘
✘
✘✘
• How well do imports predict vulnerabilities?
• Can imports be used for− classification (vulnerable or not) and for− regression (number of vulnerabilities)?
Research Questions
nsCOMArraynsIDocument.h
nspr_md.hnsDOMClassInfoEmbedGTKTools
MozillaControl.cpp
0
1
0
10
0
0
nsDOMClassInfo has had 10 vulnerability-related bug reports
Input Data
nsCOMArraynsIDocument.h
nspr_md.hnsDOMClassInfoEmbedGTKTools
MozillaControl.cpp
0
1
0
10
0
0
nsDOMClassInfo has had 10 vulnerability-related bug reports
Input Data
stdio.
h
util.h
nsSta
ckFr
ame.h
sys/fi
le.h
ssImpl.
h
nsIX
PCon
nect.
h
btre
e.h
1 0 0 0 1 0 0
0 0 1 0 0 1 0
0 1 1 0 0 1 0
0 0 1 0 1 0 0
0 0 0 0 1 0 0
0 1 0 1 0 0 0
nsDOMClassInfo imports “nsIXPConnect.h”
9,059
mor
e
Distribution of MFSAs
Number of MFSAs
Num
ber o
f Com
pone
nts
1 3 5 7 9 11 13
12
520
5030
0
Distribution of Bug Reports
Number of Bug Reports
Num
ber o
f Com
pone
nts
1 3 5 7 9 13 17 24
12
520
5030
0
Distribution
• 40 random splits6,968 rows in training set, 3,484 rows in validation set
• ClassificationTrain SVM, compute recall and precision
• RegressionTrain SVM, compute rank correlation on top 1%
• SVM: linear kernel with default parametersR implementation (up to 10GB of main memory)
Experiments
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
0.55 0.60 0.65 0.70 0.75
0.35
0.40
0.45
0.50
0.55
(a) Precision and Recall
Recall
Prec
ision
0.2 0.3 0.4 0.5 0.6 0.70.
00.
20.
40.
60.
81.
0
(b) Rank Correlation
Rank Correlation
Cum
ulat
ive
Dist
ribut
ion
●●
●●●●
●●
●●
●●
●●●●
●●●●
●●●●●●●●●
●●●●●
●●●
●●
●
Results
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
0.55 0.60 0.65 0.70 0.75
0.35
0.40
0.45
0.50
0.55
(a) Precision and Recall
Recall
Prec
ision
0.2 0.3 0.4 0.5 0.6 0.70.
00.
20.
40.
60.
81.
0
(b) Rank Correlation
Rank Correlation
Cum
ulat
ive
Dist
ribut
ion
●●
●●●●
●●
●●
●●
●●●●
●●●●
●●●●●●●●●
●●●●●
●●●
●●
●
45% (about 1/2) of predictions correct
Results
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
0.55 0.60 0.65 0.70 0.75
0.35
0.40
0.45
0.50
0.55
(a) Precision and Recall
Recall
Prec
ision
0.2 0.3 0.4 0.5 0.6 0.70.
00.
20.
40.
60.
81.
0
(b) Rank Correlation
Rank Correlation
Cum
ulat
ive
Dist
ribut
ion
●●
●●●●
●●
●●
●●
●●●●
●●●●
●●●●●●●●●
●●●●●
●●●
●●
●
2/3 of all vulnerable components detected45% (about 1/2) of predictions correct
Results
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
0.55 0.60 0.65 0.70 0.75
0.35
0.40
0.45
0.50
0.55
(a) Precision and Recall
Recall
Prec
ision
0.2 0.3 0.4 0.5 0.6 0.70.
00.
20.
40.
60.
81.
0
(b) Rank Correlation
Rank Correlation
Cum
ulat
ive
Dist
ribut
ion
●●
●●●●
●●
●●
●●
●●●●
●●●●
●●●●●●●●●
●●●●●
●●●
●●
●
2/3 of all vulnerable components detected45% (about 1/2) of predictions correct
Results
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
0.55 0.60 0.65 0.70 0.75
0.35
0.40
0.45
0.50
0.55
(a) Precision and Recall
Recall
Prec
ision
0.2 0.3 0.4 0.5 0.6 0.70.
00.
20.
40.
60.
81.
0
(b) Rank Correlation
Rank Correlation
Cum
ulat
ive
Dist
ribut
ion
●●
●●●●
●●
●●
●●
●●●●
●●●●
●●●●●●●●●
●●●●●
●●●
●●
●
2/3 of all vulnerable components detected45% (about 1/2) of predictions correct
moderately strong correlation (mostly significant at p < 0.01)
Results
Ranking
Rank Component Actual Rank1 nsDOMClassInfo 3
2 SGridRowLayout 95
3 xpcprivate 6
4 jsxml 2
5 nsGenericHTMLElement 8
6 jsgc 3
7 nsISEnvironment 12
8 jsfun 1
9 nsHTMLLabelElement 18
10 nsHttpTransaction 35
... (3,474 components)
Ranking
Rank Component Actual Rank1 nsDOMClassInfo 3
2 SGridRowLayout 95
3 xpcprivate 6
4 jsxml 2
5 nsGenericHTMLElement 8
6 jsgc 3
7 nsISEnvironment 12
8 jsfun 1
9 nsHTMLLabelElement 18
10 nsHttpTransaction 35
... (3,474 components)
Ranking
Rank Component Actual Rank1 nsDOMClassInfo 3
2 SGridRowLayout 95
3 xpcprivate 6
4 jsxml 2
5 nsGenericHTMLElement 8
6 jsgc 3
7 nsISEnvironment 12
8 jsfun 1
9 nsHTMLLabelElement 18
10 nsHttpTransaction 35
... (3,474 components)
Ranking
Rank Component Actual Rank1 nsDOMClassInfo 3
2 SGridRowLayout 95
3 xpcprivate 6
4 jsxml 2
5 nsGenericHTMLElement 8
6 jsgc 3
7 nsISEnvironment 12
8 jsfun 1
9 nsHTMLLabelElement 18
10 nsHttpTransaction 35
... (3,474 components)
Ranking
Similar Results for Bugs
Packages + Import relationships(ISESE 2006)
Precision: 66.7% Recall: 69.4%
Binaries + Dependencies(Internship @ Microsoft Research, 2006)
Precision: 64.4% Recall: 75.3%
VulturePredicting
Security Vulnerabilities(Work in Progress)
locates past + predicts newvulnerabilities
problem domain
?
Future Work
#1: Mining across Projects
• Complement source code search engines with mining techniques.
• Large-scale mining (144,000 SF projects)
#2: Developer Buddy
MOCKUP
eROSE BugCache Vulture
automatic
eROSE BugCache Vulture
automaticlarge-scale
eROSE BugCache Vulture
automatic
tool-oriented
large-scale
eROSE BugCache Vulture
2.0
Empirical Software Engineering 2.0
automatic
tool-oriented
large-scale
2.0
Empirical Software Engineering 2.0
automatic
tool-oriented
large-scale
Thanks! Questions?