sthomas slides

Post on 12-Apr-2017

79 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Modeling the Evolution of Topics in Source Code Histories

Stephen W. Thomas

Bram Adams

Ahmed E. Hassan

Dorothea Blostein

[2]

[3]Time

Pop

ular

ity

Linux Development

Audio Codecs

What have the Skype developers been

interested in?

Microsoft manager

[4]

What are developers working on?

Option 1: Speak with every developer

Time

Pop

ular

ity

Linux Development

Audio Codecs

Option 2: Use automated tool

[5]

Tool: Topic Evolution Models

…Topic “Linux”

… Topic “codec”Topi

c P

opul

arity

Time

V1.0 V1.1 V1.2 V2.0 V4.0

Topic “GUI”…

Applied to Source Code Histories

[6]

Success in Other Domains

Email Archives

Conference Proceedings Newspaper Articles

[7]

Topic Evolution on Source Code

Topic Model

Mapping Topics Over Time

Background: The Hall Model

[8]

V1.0V1.1

V1.2V1.3

XMLFile I/O

XMLGUI

GUIFile I/O

XMLFile I/O

XMLFile I/O

XMLFile I/O

XMLFile I/O

XMLGUI

XMLGUI

XMLGUI

XMLGUI

GUIFile I/O

[9]

V1 V2 V3 V4 V5

File

ID

Topic 1: XMLTopic 2: GUITopic 3: File I/O

Expect:

XMLFile I/O

XMLGUI

GUIFile I/O

XMLFile I/O

XMLFile I/O

XMLFile I/O

XMLFile I/O

XMLGUI

XMLGUI

XMLGUI

XMLGUI

GUIFile I/O

Topic 1: XML+ File I/OTopic 2: XML + GUITopic 3: GUI+ File I/O

Get:

Topic 1

Topic 3

Topic 2

Problem: Topics are muddled, not distinct

[10]

Pop

ular

ity

File I/O

XMLGUIExpect:

V1 V2 V3 V4 V5

File

ID

XMLFile I/O

XMLGUI

GUIFile I/O

XMLFile I/O

XMLFile I/O

XMLFile I/O

XMLFile I/O

XMLGUI

XMLGUI

XMLGUI

XMLGUI

GUIFile I/O

XMLGUI

Problem: Evolutions not sensitive or accurate

Pop

ular

ity Get:

Topic 3Topic 1

Topic 2

Topic 1

Topic 3

Topic 2

Topic 2

[11]

Problems due to duplication

Topics are muddled, not distinct

Evolutions are not accurate

Found in Source Code Histories

63% files don’t change

84% files don’t change

99.8% words don’t change

99.8% words don’t change

[12]

JHotDraw

Real-World Duplication

The Diff Model

[13]

Topic Model

MappingTopics Over

TimeDiff Reconstruction

Step

The Diff Model

[14]

V1.0V1.1

V1.2V1.3

...

if (vacstmt->options & VACOPT_VACUUM){ PreventTransactionChain(isTopLevel, stmttype); in_outer_xact = false;}...

...// Don’t run VACUUM in user transition block!if (vacstmt->options & VACOPT_VACUUM){ PreventTransactionChain(isTopLevel, stmttype); in_inner_xact = false;}...

Version 5.3.7 Version 5.3.8

// Don’t run VACUUM in user transition block!in_inner_xact = false;

Diff

in_outer_xact = false;

Deleted lines Added lines

Diff Step

[15]

[16]

GUI (77%)XML (23%)

SecondVersion

FirstVersion

GUI (90%)XML (10%) =- +

Reconstructing Topic Memberships

(1000 * 90%) - (200*100%) + (150*20%) = 730

?

(950 lines)(150 lines)(200 lines)(1000 lines)

(1000 * 10%) - (200*0%) + (150*80%) = 220

Topic Model

DeletedLines

GUI (100%)XML (0%)

Topic Model

AddedLines

GUI (20%)XML (80%)

Topic Model Infer

Case Studies

[17]

JHotDraw

Drawing Application Framework (Java)

13 releases (5.2.0 – 7.5.1)613 files84K SLOC

Database Management System(C)

46 releases (7.0.0 – 8.3.5)844 files501K SLOC

I bet the Diff model discovers topics that are more distinct!

[18]

High KL divergence High distinctness [19]

Measuring Distinctness

xml fopen button element menu fclose attribute

Wor

d P

roba

bilit

y XML topic

GUI topic

xml fopen button element menu fclose attribute

Wor

d P

roba

bilit

y

xml fopen button element menu fclose attribute

Wor

d P

roba

bilit

y XML + File IO topic

xml fopen button element menu fclose attribute

Wor

d P

roba

bilit

y XML + GUI topic

Low KL divergence Low distinctness

With KL-Divergence

[20]

Average Topic Distinctness

Topic 1Topic 2Topic 3Topic 4Topic 5…Topic K

Topic 1Topic 2Topic 3Topic 4Topic 5…Topic K

Hall TopicsTopic 1Topic 2Topic 3Topic 4Topic 5…Topic K

Topic 1Topic 2Topic 3Topic 4Topic 5…Topic K

Diff Topics

+32% +38%

Diff makes more distinct topics

[21]

JHotDraw

[22]

Topics are muddled, not distinct

Evolutions are not accurate

Diff makes more distinct topics

Problems due to duplicationFound in Source Code Histories

I bet the Diff model discovers more accurate topic evolutions

[23]

[24]

No oracle dataset

Measuring Accuracy

Create simulatedscenario by handTruth known

1.

Manually investigateevolutions in JHotDraw and PSQL

2.

Truth learned

v1 v2 v3 v4 v5 v6 v7 v8 v9 v10

copy copy copy copy copy copy copy copy copy

PSQLbackend.access

Simulated Project

[25]

v1 v2 v3 v4 v5 v6 v7 v8 v9 v10

3 files from PSQLtimezone

Simulated Scenario 1

timezone topic

[26]

Manual Investigation

[27]

Topic 1

2. Validate against project documentation (commit logs, release notes, etc.)

1 .Select change events

Diff makes more accurate topics

[28]

+25% precision

SimulatedProject

+33% precision

JHotDraw

+47% precision

+100% recall

[29]

Topics are muddled, not distinct

Evolutions are not accurate

Diff makes more distinct topics

Diff makes more accurate evolutions

Problems due to duplicationFound in Source Code Histories

[30]

Summary

top related