statistical distribution of metrics

Statistical distributions of software metrics: dothey matter?

Israel Herraiz

Technical University of Madrid

israel.herraiz@upm.es

Grab these slides from

http://slideshare.net/herraiz/statistical-distributions-of-metrics

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 1/17

Outline

1 Some background

2 Statistical properties of software metrics

3 Evidence of impact on quality

4 Summary of findings and further work

1 Some background

A (not so) long time ago...

Statistical distribution of software metrics

Software size follows a double Pareto distributionTowards a theoretical model for software growth MSR 2007

More recently

Not only size, but some OO metrics too (and some complexity metrics)On the Statistical Distribution of Object-Oriented SystemProperties WETSoM 2012

OK, but what is that double Pareto thing?

1 100 10000

Double Pareto

Lognormal

But does it matter?

Most of the files are on thelognormal side

C C++ Java Python Lisp

But does it matter?

Most of the files are on thelognormal side

C C++ Java Python Lisp

But the power law minoritymatters a lot

C C++ Java Python Lisp%

Large files have a large impact

Size estimation models

Some software size estimation models are based on the log-normality of sizemetrics. These models systematically underestimate the size of software.

2000 5000 10000 50000

2000 5000 20000 50000

1000 2000 5000 10000

−100

1000 2000 5000 10000

−100

Python

On the distribution of source code file sizes ICSOFT 2011

1 Some background

Parameters of the statistical distribution

Power law parameters: λ and xmin

Transition from lognormal to power law

1 100 10000

Double Pareto

Lognormal

1 Some background

Probability of finding defects

We have seen that files above xmin account for 40% of total size, beingonly about ∼ 1% of the files.

What about defects? Probability of finding defects in three softwareprojects (using CYCLO as metric)

Project Below xmin Above xmin

Apache .4178 .7708OpenIntents .2500 .7500Zxing .2143 .4161

* Data extracted from “ReLink: Recovering Links between Bugs and Changes” FSE

Probability of finding defects (normalized metrics)

Using CYCLO / WMC as metric (cyclomatic complex. per LOC)

Project Below xmin Above xmin

Apache .4159 .6296OpenIntents .2813 .5417Zxing .3181 .2389

Defects density (only pre-release defects)

Using Number of Methods and number of pre-release defects per LOC

Below xmin Above xmin

0 1 2 3 4 5 6 7 8 9 100

12000Below xmin

0 0.05 0.1 0.15 0.2 0.25 0.3 0.350

300Above xmin

Avg .Dens. = .2685 Avg .Dens. = .4565

* Data obtained from "Predicting Defects for Eclipse” PROMISE 2007

Defects density (only post-release defects)

Using Number of Methods and number of post-release defects per LOC

0 1 2 3 4 5 6 7 8 9 100

12000Below xmin

0 0.05 0.1 0.15 0.2 0.25 0.3 0.350

300Above xmin

Avg .Dens. = .1437 Avg .Dens. = .2690

Defects density (pre + post-release defects)

Using CYCLO/SLOC and number of total defects per LOC

10−1

10−4

10−3

10−2

10−1

Avg .Dens. = .3335 (>9000 files) Avg .Dens. = .7747 (364 files)Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 15/17

1 Some background

Summary and further work

Summary of preliminary findings

Some metrics have a transition from lognormal to power law

Clear relation between normalized metrics and defects density

Although the threshold might not be perfect (e.g., you might find ahigh defects density in a lower side file), it greatly reduces the searchspace for potentially problematic files

Further work

Verify in more projects

Do you have defects data at the file level?

Find explanation for the transition and its influence on quality

How do the statistical parameters change over time? Do defectsevolve accordingly?

statistical distribution of metrics

Education

baseline distribution metrics for australian wine...

statistical inference: probability and distribution

investigation of statistical distribution of energization...

fractal fluctuations and statistical normal distribution

02-020 new statistical distribution applied hydrology …

project portfolio management techniques using statistical...

statistical distributions. ． bernoulli distribution ．...

statistical distribution

statistical techniques i exst7005 the f distribution

electrostatics, statistical mechanics, and dynamics of … &...

a statistical approach to culture colors distribution...

on the statistical sensitivity of semantic similarity...

statistical analysis of the distribution in

appendix b: statistical methods. statistical methods:...

standardized statistical measures and metrics for public...

statistical tables t distribution

why content is king | distribution queen | metrics the...

is weibull distribution the most appropriate statistical...

on statistical analysis and optimization of information...

statistical hypothesis testing - lafayette college ·...