statistical distribution of metrics

Post on 22-Nov-2014

497 Views

Category:

Education

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presentation for the Seminar on Open Source Evolution 2013 http://informatique.umons.ac.be/genlog/SOS-Evol/SOS-Evol2013.html

TRANSCRIPT

Statistical distributions of software metrics: dothey matter?

Israel Herraiz

Technical University of Madrid

israel.herraiz@upm.es

Grab these slides from

http://slideshare.net/herraiz/statistical-distributions-of-metrics

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 1/17

Outline

1 Some background

2 Statistical properties of software metrics

3 Evidence of impact on quality

4 Summary of findings and further work

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 2/17

1 Some background

2 Statistical properties of software metrics

3 Evidence of impact on quality

4 Summary of findings and further work

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 3/17

A (not so) long time ago...

Statistical distribution of software metrics

Software size follows a double Pareto distributionTowards a theoretical model for software growth MSR 2007

More recently

Not only size, but some OO metrics too (and some complexity metrics)On the Statistical Distribution of Object-Oriented SystemProperties WETSoM 2012

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 4/17

OK, but what is that double Pareto thing?

1 100 10000

1e

−0

41

e−

02

1e

+0

0

SLOC

P[X

> x

]

Data

Double Pareto

Lognormal

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 5/17

But does it matter?

Most of the files are on thelognormal side

C C++ Java Python Lisp

% F

iles

05

10

15

20

25

30

35

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 6/17

But does it matter?

Most of the files are on thelognormal side

C C++ Java Python Lisp

% F

iles

05

10

15

20

25

30

35

But the power law minoritymatters a lot

C C++ Java Python Lisp%

SLO

C

010

20

30

40

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 6/17

Large files have a large impact

Size estimation models

Some software size estimation models are based on the log-normality of sizemetrics. These models systematically underestimate the size of software.

2000 5000 10000 50000

−1

00

05

0C

SLOC

RE

2000 5000 20000 50000

−1

00

05

0

C++

SLOC

RE

1000 2000 5000 10000

−100

050

Java

SLOC

RE

1000 2000 5000 10000

−100

050

Python

SLOC

RE

On the distribution of source code file sizes ICSOFT 2011

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 7/17

1 Some background

2 Statistical properties of software metrics

3 Evidence of impact on quality

4 Summary of findings and further work

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 8/17

Parameters of the statistical distribution

Power law parameters: λ and xmin

Transition from lognormal to power law

1 100 10000

1e−

04

1e−

02

1e+

00

SLOC

P[X

> x

]

Data

Double Pareto

Lognormal

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 9/17

1 Some background

2 Statistical properties of software metrics

3 Evidence of impact on quality

4 Summary of findings and further work

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 10/17

Probability of finding defects

Probability of finding defects

We have seen that files above xmin account for 40% of total size, beingonly about ∼ 1% of the files.

What about defects? Probability of finding defects in three softwareprojects (using CYCLO as metric)

Project Below xmin Above xmin

Apache .4178 .7708OpenIntents .2500 .7500Zxing .2143 .4161

* Data extracted from “ReLink: Recovering Links between Bugs and Changes” FSE

2011.

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 11/17

Probability of finding defects

Probability of finding defects (normalized metrics)

Using CYCLO / WMC as metric (cyclomatic complex. per LOC)

Project Below xmin Above xmin

Apache .4159 .6296OpenIntents .2813 .5417Zxing .3181 .2389

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 12/17

Probability of finding defects

Defects density (only pre-release defects)

Using Number of Methods and number of pre-release defects per LOC

Below xmin Above xmin

0 1 2 3 4 5 6 7 8 9 100

2000

4000

6000

8000

10000

12000Below xmin

0 0.05 0.1 0.15 0.2 0.25 0.3 0.350

50

100

150

200

250

300Above xmin

Avg .Dens. = .2685 Avg .Dens. = .4565

* Data obtained from "Predicting Defects for Eclipse” PROMISE 2007

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 13/17

Probability of finding defects

Defects density (only post-release defects)

Using Number of Methods and number of post-release defects per LOC

Below xmin Above xmin

0 1 2 3 4 5 6 7 8 9 100

2000

4000

6000

8000

10000

12000Below xmin

0 0.05 0.1 0.15 0.2 0.25 0.3 0.350

50

100

150

200

250

300Above xmin

Avg .Dens. = .1437 Avg .Dens. = .2690

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 14/17

Probability of finding defects

Defects density (pre + post-release defects)

Using CYCLO/SLOC and number of total defects per LOC

10−1

101

103

105

10−4

10−3

10−2

10−1

100

Pr(

X ≥

x)

x

10−1

100

101

102

103

104

105

10−1

100

101

102

103

Below xmin Above xmin

Avg .Dens. = .3335 (>9000 files) Avg .Dens. = .7747 (364 files)Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 15/17

1 Some background

2 Statistical properties of software metrics

3 Evidence of impact on quality

4 Summary of findings and further work

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 16/17

Summary and further work

Summary of preliminary findings

Some metrics have a transition from lognormal to power law

Clear relation between normalized metrics and defects density

Although the threshold might not be perfect (e.g., you might find ahigh defects density in a lower side file), it greatly reduces the searchspace for potentially problematic files

Further work

Verify in more projects

Do you have defects data at the file level?

Find explanation for the transition and its influence on quality

How do the statistical parameters change over time? Do defectsevolve accordingly?

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 17/17

top related