mining software archives to support software development

Post on 27-Jun-2015

1.650 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Job application talk.

TRANSCRIPT

Mining Software Archives to Support Software Development

Tom ZimmermannSaarland University

Software Development

BuildHello Calgary!

Software Development

Build

Collaboration

Collaboration

Collaboration

Comm. Archive

Collaboration

Comm. Archive

VersionArchive

Collaboration

Comm. Archive

Bug Database

VersionArchive

Collaboration

Comm. Archive

Bug Database

VersionArchive

Mining Software Archives

Mining Software Archives

Mining Software Archives

eROSE BugCache Vulture

eROSERelated Changes

(ICSE 2004, TSE 2005)

Tom Zimmermann • Saarland UniversityPeter Weißgerber • University of Trier

Stephan Diehl • University of Trier Andreas Zeller • Saarland University

Developers who changed this functionalso changed...

eROSE: Guiding Developers

PurchaseHistory

Customers who bought this item also

bought...

eROSE: Guiding Developers

PurchaseHistory

Customers who bought this item also

bought...

Version Archive

Developers who changed this function

also changed...

eROSE suggests further locations.

eROSE prevents incomplete changes.

Processing CVS data

Processing CVS data

Processing CVS data

1. Comparing files2. Building transactions

Comparing Files

A()

C()

E()

D()

B()

Comparing Files

A()

C()

E()

D()

B()

A()

B()

E()

F()

D()

Comparing Files

A()

C()

E()

D()

B()

A()

B()

E()

F()

D()

Comparing Files

Building Transactions

CVS

150,000

Building Transactions

CVS

150,000

createGeneralPage()createTextComparePage()fKeys[]initDefaults()buildnotes_compare.htmlPatchMessages.propertiesplugin.properties

2003-02-19 (aweinand): fixed #13332

Building Transactions

CVS

150,000

createGeneralPage()createTextComparePage()fKeys[]initDefaults()buildnotes_compare.htmlPatchMessages.propertiesplugin.properties

2003-02-19 (aweinand): fixed #13332

same author + message + time

Mining Associations

User changes fKeys[] and initDefaults()

Mining Associations

Mining Associations

EROSE finds past transactions

Mining Associations

fKeys[]initDefaults()...plugin.properties

#104223

fKeys[]initDefaults()...plugin.properties

#756fKeys[]initDefaults()...plugin.properties

#6721fKeys[]initDefaults()...plugin.properties

#21078

fKeys[]initDefaults()...plugin.properties

#42432fKeys[]initDefaults()...plugin.properties

#51345fKeys[]initDefaults()...plugin.properties

#59998fKeys[]initDefaults()...plugin.properties

#71003

fKeys[]initDefaults()...

#87264fKeys[]initDefaults()...plugin.properties

#91220fKeys[]initDefaults()...plugin.properties

#101823

EROSE finds past transactions

EROSE finds past transactions

fKeys[]initDefaults()...plugin.properties

#104223

Mining Associations

fKeys[]initDefaults()...plugin.properties

#756fKeys[]initDefaults()...plugin.properties

#6721fKeys[]initDefaults()...plugin.properties

#21078

fKeys[]initDefaults()...plugin.properties

#42432fKeys[]initDefaults()...plugin.properties

#51345fKeys[]initDefaults()...plugin.properties

#59998fKeys[]initDefaults()...plugin.properties

#71003

fKeys[]initDefaults()...

#87264fKeys[]initDefaults()...plugin.properties

#91220fKeys[]initDefaults()...plugin.properties

#101823

{fKeys[], initDefaults()} ⇒ {plugin.properties}Support 10, Confidence 10/11 = 0.909

PostgreSQL

Evaluation

jEdit KOffice

GIMP

PostgreSQL

Evaluation

jEdit KOffice

GIMPEROSE predicts 33% of all changed entities.(files: 44%)

PostgreSQL

Evaluation

jEdit KOffice

GIMPEROSE predicts 33% of all changed entities.(files: 44%)

In 70% of all transactions, EROSE’s topmost three suggestions contain a changed entity.(files: 72%)

PostgreSQL

Evaluation

jEdit KOffice

GIMPEROSE predicts 33% of all changed entities.(files: 44%)

In 70% of all transactions, EROSE’s topmost three suggestions contain a changed entity.(files: 72%)

EROSE learns quickly (within 30 days).

eROSERelated Changes

(ICSE 2004, TSE 2005)

non-program elements(documentation)

learns quickly

guides developers

`

BugCachePredicting Defects

(ASE 2006, ICSE 2007)

Sung Kim • MITTom Zimmermann • Saarland University

Jim Whitehead • Univ. of California SC Andreas Zeller • Saarland University

The Problem

How should we allocate our resources for quality assurance?

One Solution

List with elements that (will) have defects

List is adaptive, i.e., it changes over time

One Solution

List with elements that (will) have defects

List is adaptive, i.e., it changes over time

Cache

The BugCache Model

Cache size: 2

Hypothesis: Temporal locality between defects

What is loaded in the cache?

The BugCache Model

Cache size: 2

Hypothesis: Temporal locality between defects

What is loaded in the cache?

The BugCache Model

Cache size: 2

Hypothesis: Temporal locality between defects

What is loaded in the cache?

The BugCache Model

Cache size: 2

Hypothesis: Temporal locality between defects

What is loaded in the cache?

The BugCache Model

Cache size: 2

Hypothesis: Temporal locality between defects

What is loaded in the cache?

The BugCache Model

Miss

Cache size: 2

Hypothesis: Temporal locality between defects

What is loaded in the cache?

The BugCache Model

Miss

Cache size: 2

Hypothesis: Temporal locality between defects

What is loaded in the cache?

The BugCache Model

Miss

Cache size: 2

The BugCache Model

Miss

Cache size: 2

The BugCache Model

Miss Hit

Cache size: 2

The BugCache Model

Miss Hit

Cache size: 2

The BugCache Model

Miss Hit Miss

Cache size: 2

The BugCache Model

Miss Hit Miss

Cache size: 2

The BugCache Model

Miss Hit Miss

Cache size: 2

Hit rate = #Hits / #Defects = 33.3%

The BugCache Model

Miss Hit Miss

Cache size: 2

The BugCache Model

Miss Hit Miss

Cache size: 2

The BugCache Model

Miss Hit Miss Miss

Cache size: 2

The BugCache Model

Miss Hit Miss Miss

Cache size: 2

The BugCache Model

Miss Hit Miss Miss

Cache size: 2

Loading Elements

Temporal locality – as shown before

Spatial locality – load “nearby” elements (i.e., co-changed before)

Changed-entity locality – load changed elements

New-entity locality – load new elements

Initial pre-fetch – start with a loaded cache

Evaluation

PostgreSQLjEdit

Mozilla

Columba

Hit Rates

Methods Files

Project BugCache FixCache BugCache FixCache

Apache 1.3ColumbaEclipseJEditMozillaPostgreSQL Subversion

59.6%58.9%64.5%50.5%49.3%61.9%68.3%

61.5%67.6%71.6%48.9%55.0%59.2%43.8%

83.9%83.5%95.1%85.7%93.3%73.9%82.0%

81.5%83.0%95.0%85.4%88.0%71.0%81.3%

Cache size = 10%

Hit Rates

Methods Files

Project BugCache FixCache BugCache FixCache

Apache 1.3ColumbaEclipseJEditMozillaPostgreSQL Subversion

59.6%58.9%64.5%50.5%49.3%61.9%68.3%

61.5%67.6%71.6%48.9%55.0%59.2%43.8%

83.9%83.5%95.1%85.7%93.3%73.9%82.0%

81.5%83.0%95.0%85.4%88.0%71.0%81.3%

Cache size = 10%

Reasons for Hits

Spatial locality18%

Temporal locality60%

Initial pre-fetch18%

Initial pre-fetchTemporal localitySpatial localityChanged-entity localityNew-entity locality

Warning Developers

“Safe” Location(not in FixCache)

Risky Location(red, in FixCache)

BugCachePredicting Defects

(ASE 2006, ICSE 2007)

adaptive

hit rates of 71%~95%

temporal locality

VulturePredicting

Security Vulnerabilities(Work in Progress)

Stephan Neuhaus • Saarland University

Tom Zimmermann • Saarland UniversityAndreas Zeller • Saarland University

Firefox/Mozilla

14,368 C/C++ files (10,452 components) 1,012,512 revisions

228,365 commits>700 developers

14,368 C/C++ files (10,452 components) 1,012,512 revisions

228,365 commits>700 developers

Vulnerabilities

Vulnerabilities

Vulnerabilities0

Vulnerabilities

Security Advisory 2005-12

Title: Livefeed bookmarks can steal cookiesImpact: HighProducts: FirefoxDescription: Earlier versions of Firefox allowed javascript: and data: URLs as Livefeed bookmarks. When they updated the URL would be run in the context of the current page and could be used to steal cookies or data displayed on the page. If the user were on a page with elevated privileges (for example, about:config) when the Livefeed was updated, the feed URL could potentially run arbitrary code on the user's machine.

Vulnerabilities0

Vulnerabilities

Vulnerabilities0

Vulnerabilities

Security Advisory 2005-13

Title: Window Injection SpoofingSeverity: LowProducts: Firefox, Mozilla SuiteDescription: A website can inject content into a popup opened by another site if the target name of the popup window is known. An attacker who knows you are going to visit that other site could spoof the contents of the popup.

Vulnerabilities0

Vulnerabilities

Security Advisory 2005-14

Title: SSL "secure site" indicator spoofingSeverity: ModerateProducts: Firefox, Mozilla SuiteDescription: Various schemes were reported that could cause the "secure site" lock icon to appear and show certificate details for the wrong site. These could be used by phishers to make their spoofs look more legitimate, particularly in windows that hide the address bar showing the true location.

Security Advisory 2005-15Title: Heap overflow possible in UTF8 to Unicode conversionSeverity: HighProducts: Firefox, Thunderbird, Mozilla SuiteDescription: It is possible for a UTF8 string with invalid sequences to trigger a heap overflow of converted Unicode data. Exploitability would depend on the attackers ability to get the string into the buggy converter. General web content is converted elsewhere but we can't rule out the possibility of a successful attack.

Security Advisory 2005-16Title: Spoofing download and security dialogs with overlapping windowsSeverity: HighProducts: Firefox, Mozilla SuiteDescription: Michael Krax demonstrates that the download dialog and security dialogs can be spoofed by partially covering them with an overlapping window. Some users may not notice the OS window border and browser statusbar bisecting what appears to be a single dialog, and be convinced by the spoofing text of the top-most window to click on the "Allow" or "Open" button of the window below.

Vulnerabilities0

Security Advisory 2005-41Title: Privilege escalation via DOM property overridesSeverity: CriticalProducts: Firefox, Mozilla SuiteDescription: moz_bug_r_a4 reported several exploits giving an attacker the ability to install malicious code or steal data, requiring only that the user do commonplace actions like click on a link or open the context menu. The common cause in each case was privileged UI code ("chrome") being overly trusting of DOM nodes from the content window.

Security Advisory 2006-76Title: XSS using outer window's Function objectImpact: HighProducts: Firefox 2.0Description: moz_bug_r_a4 demonstrated that the Function prototype regression described in bug 355161 could be exploited to bypass the protections against cross site script (XSS) injection, which could be used to steal credentials or sensitive data from arbitrary sites or perform destructive actions on behalf of a logged-in user.

Vulnerabilities

Vulnerabilities0

Vulnerabilities

Vulnerabilities0

Vulnerabilities

components

vulnerable424

10,452

4.05%

Vulnerabilities0

What other components are

vulnerable?

Vulnerabilities

Vulnerabilities0

Vulnerabilities

Vulnerabilities0

Vulnerabilities

?

Vulnerabilities0

Is this new component likely to be vulnerable?

Vulnerabilities

?

Vulture

Vulnerability Database

Version Archive

CodeCodeCodeCodeRedo diagram

Vulture

Vulnerability Database

Version Archive

CodeCodeCodeCode

Vulture

Redo diagram

Component Component Component

Vulture

Vulnerability Database

Version Archive

CodeCodeCodeCode

Vulture

Redo diagram

Predictor

Component Component Component

Vulture

Vulnerability Database

Version Archive

CodeCodeCodeCode

Vulture

Redo diagram

Predictor

Code

Component Component Component

Vulture

Vulnerability Database

Version Archive

CodeCodeCodeCode

Vulture

Redo diagram

Correlations

Programmer Code Complexity

Correlations

Language

Code Complexity

Correlations

Language

Correlations

Language

Problem Domain

Correlations

Language

Imports

GUI Database Certificates OS

Imports

GUI Database Certificates OS

Imports

GUI Database Certificates OS

Imports

nsIContent.h

nsIContentUtils.h

nsIScriptSecurityManager.h

Example (1)

nsIContent.h

nsIContentUtils.h

nsIScriptSecurityManager.h

Example (1)

import

✘✔

nsIContent.h

nsIContentUtils.h

nsIScriptSecurityManager.h

Example (1)

import

95.5%

nsIPrivateDOMEvent.h

nsReadableUtils.h

Example (2)

nsIPrivateDOMEvent.h

nsReadableUtils.h

Example (2)

import

nsIPrivateDOMEvent.h

nsReadableUtils.h

Example (2)

import

100%

✘✘

• How well do imports predict vulnerabilities?

• Can imports be used for− classification (vulnerable or not) and for− regression (number of vulnerabilities)?

Research Questions

nsCOMArraynsIDocument.h

nspr_md.hnsDOMClassInfoEmbedGTKTools

MozillaControl.cpp

0

1

0

10

0

0

nsDOMClassInfo has had 10 vulnerability-related bug reports

Input Data

nsCOMArraynsIDocument.h

nspr_md.hnsDOMClassInfoEmbedGTKTools

MozillaControl.cpp

0

1

0

10

0

0

nsDOMClassInfo has had 10 vulnerability-related bug reports

Input Data

stdio.

h

util.h

nsSta

ckFr

ame.h

sys/fi

le.h

ssImpl.

h

nsIX

PCon

nect.

h

btre

e.h

1 0 0 0 1 0 0

0 0 1 0 0 1 0

0 1 1 0 0 1 0

0 0 1 0 1 0 0

0 0 0 0 1 0 0

0 1 0 1 0 0 0

nsDOMClassInfo imports “nsIXPConnect.h”

9,059

mor

e

Distribution of MFSAs

Number of MFSAs

Num

ber o

f Com

pone

nts

1 3 5 7 9 11 13

12

520

5030

0

Distribution of Bug Reports

Number of Bug Reports

Num

ber o

f Com

pone

nts

1 3 5 7 9 13 17 24

12

520

5030

0

Distribution

• 40 random splits6,968 rows in training set, 3,484 rows in validation set

• ClassificationTrain SVM, compute recall and precision

• RegressionTrain SVM, compute rank correlation on top 1%

• SVM: linear kernel with default parametersR implementation (up to 10GB of main memory)

Experiments

● ●

●●

0.55 0.60 0.65 0.70 0.75

0.35

0.40

0.45

0.50

0.55

(a) Precision and Recall

Recall

Prec

ision

0.2 0.3 0.4 0.5 0.6 0.70.

00.

20.

40.

60.

81.

0

(b) Rank Correlation

Rank Correlation

Cum

ulat

ive

Dist

ribut

ion

●●

●●●●

●●

●●

●●

●●●●

●●●●

●●●●●●●●●

●●●●●

●●●

●●

Results

● ●

●●

0.55 0.60 0.65 0.70 0.75

0.35

0.40

0.45

0.50

0.55

(a) Precision and Recall

Recall

Prec

ision

0.2 0.3 0.4 0.5 0.6 0.70.

00.

20.

40.

60.

81.

0

(b) Rank Correlation

Rank Correlation

Cum

ulat

ive

Dist

ribut

ion

●●

●●●●

●●

●●

●●

●●●●

●●●●

●●●●●●●●●

●●●●●

●●●

●●

45% (about 1/2) of predictions correct

Results

● ●

●●

0.55 0.60 0.65 0.70 0.75

0.35

0.40

0.45

0.50

0.55

(a) Precision and Recall

Recall

Prec

ision

0.2 0.3 0.4 0.5 0.6 0.70.

00.

20.

40.

60.

81.

0

(b) Rank Correlation

Rank Correlation

Cum

ulat

ive

Dist

ribut

ion

●●

●●●●

●●

●●

●●

●●●●

●●●●

●●●●●●●●●

●●●●●

●●●

●●

2/3 of all vulnerable components detected45% (about 1/2) of predictions correct

Results

● ●

●●

0.55 0.60 0.65 0.70 0.75

0.35

0.40

0.45

0.50

0.55

(a) Precision and Recall

Recall

Prec

ision

0.2 0.3 0.4 0.5 0.6 0.70.

00.

20.

40.

60.

81.

0

(b) Rank Correlation

Rank Correlation

Cum

ulat

ive

Dist

ribut

ion

●●

●●●●

●●

●●

●●

●●●●

●●●●

●●●●●●●●●

●●●●●

●●●

●●

2/3 of all vulnerable components detected45% (about 1/2) of predictions correct

Results

● ●

●●

0.55 0.60 0.65 0.70 0.75

0.35

0.40

0.45

0.50

0.55

(a) Precision and Recall

Recall

Prec

ision

0.2 0.3 0.4 0.5 0.6 0.70.

00.

20.

40.

60.

81.

0

(b) Rank Correlation

Rank Correlation

Cum

ulat

ive

Dist

ribut

ion

●●

●●●●

●●

●●

●●

●●●●

●●●●

●●●●●●●●●

●●●●●

●●●

●●

2/3 of all vulnerable components detected45% (about 1/2) of predictions correct

moderately strong correlation (mostly significant at p < 0.01)

Results

Ranking

Rank Component Actual Rank1 nsDOMClassInfo 3

2 SGridRowLayout 95

3 xpcprivate 6

4 jsxml 2

5 nsGenericHTMLElement 8

6 jsgc 3

7 nsISEnvironment 12

8 jsfun 1

9 nsHTMLLabelElement 18

10 nsHttpTransaction 35

... (3,474 components)

Ranking

Rank Component Actual Rank1 nsDOMClassInfo 3

2 SGridRowLayout 95

3 xpcprivate 6

4 jsxml 2

5 nsGenericHTMLElement 8

6 jsgc 3

7 nsISEnvironment 12

8 jsfun 1

9 nsHTMLLabelElement 18

10 nsHttpTransaction 35

... (3,474 components)

Ranking

Rank Component Actual Rank1 nsDOMClassInfo 3

2 SGridRowLayout 95

3 xpcprivate 6

4 jsxml 2

5 nsGenericHTMLElement 8

6 jsgc 3

7 nsISEnvironment 12

8 jsfun 1

9 nsHTMLLabelElement 18

10 nsHttpTransaction 35

... (3,474 components)

Ranking

Rank Component Actual Rank1 nsDOMClassInfo 3

2 SGridRowLayout 95

3 xpcprivate 6

4 jsxml 2

5 nsGenericHTMLElement 8

6 jsgc 3

7 nsISEnvironment 12

8 jsfun 1

9 nsHTMLLabelElement 18

10 nsHttpTransaction 35

... (3,474 components)

Ranking

Similar Results for Bugs

Packages + Import relationships(ISESE 2006)

Precision: 66.7% Recall: 69.4%

Binaries + Dependencies(Internship @ Microsoft Research, 2006)

Precision: 64.4% Recall: 75.3%

VulturePredicting

Security Vulnerabilities(Work in Progress)

locates past + predicts newvulnerabilities

problem domain

?

Future Work

#1: Mining across Projects

• Complement source code search engines with mining techniques.

• Large-scale mining (144,000 SF projects)

#2: Developer Buddy

MOCKUP

eROSE BugCache Vulture

automatic

eROSE BugCache Vulture

automaticlarge-scale

eROSE BugCache Vulture

automatic

tool-oriented

large-scale

eROSE BugCache Vulture

2.0

Empirical Software Engineering 2.0

automatic

tool-oriented

large-scale

2.0

Empirical Software Engineering 2.0

automatic

tool-oriented

large-scale

Thanks! Questions?

top related