seeing the forest for the trees, umons 2011

37
Seeing the forest for the trees Bogdan Vasilescu [email protected] http://www.win.tue.nl/bvasiles/ Software Engineering and Technology group Eindhoven University of Technology November 23, 2011

Upload: bogdan-vasilescu

Post on 17-Dec-2014

309 views

Category:

Education


0 download

DESCRIPTION

Slides used during the talk at UMons in November 2011.

TRANSCRIPT

Page 1: Seeing the forest for the trees, UMons 2011

Seeing the forest for the trees

Bogdan [email protected]://www.win.tue.nl/∼bvasiles/

Software Engineering and Technology groupEindhoven University of Technology

November 23, 2011

Page 2: Seeing the forest for the trees, UMons 2011

2/21

/ department of mathematics and computer science

Eindhoven

Page 3: Seeing the forest for the trees, UMons 2011

2/21

/ department of mathematics and computer science

Eindhoven

Page 4: Seeing the forest for the trees, UMons 2011

3/21

/ department of mathematics and computer science

Computer Science @TU/e

I Section Model Driven Software Engineering (MDSE)I Group Software Engineering and Technology (SET)

Mark van den Brand Alexander Serebrenik

Page 5: Seeing the forest for the trees, UMons 2011

3/21

/ department of mathematics and computer science

Computer Science @TU/e

I Section Model Driven Software Engineering (MDSE)I Group Software Engineering and Technology (SET)

Mark van den Brand Alexander Serebrenik

Page 6: Seeing the forest for the trees, UMons 2011

4/21

/ department of mathematics and computer science

Interested in . . .

I Software evolutionAggregation of code metrics Activity in open-source projects

I Computational geometry

Page 7: Seeing the forest for the trees, UMons 2011

4/21

/ department of mathematics and computer science

Interested in . . .

I Software evolutionAggregation of code metrics Activity in open-source projects

I Computational geometry

Page 8: Seeing the forest for the trees, UMons 2011

5/21

/ department of mathematics and computer science

Aggregation of software metrics

Maintaining a software system is like renovating a house.

Maintainability assessment precedes changing the software.

Metrics are often applied to measure maintainability.

But metrics are defined at a low level (method, class).

We need aggregation techniques.

Page 9: Seeing the forest for the trees, UMons 2011

6/21

/ department of mathematics and computer science

Aggregation of software metrics

Page 10: Seeing the forest for the trees, UMons 2011

7/21

/ department of mathematics and computer science

Traditional aggregation techniques

Standard summary statistics: mean, median, . . .

Red line – mean; blue line – median

Page 11: Seeing the forest for the trees, UMons 2011

8/21

/ department of mathematics and computer science

Recent trend: Inequality indices

Econometrics: measure/explain the inequality of income or wealth.

Software metrics and econometric variables have distributions withsimilar shapes.

Source Lines of Code: freecol−0.9.4

SLOC per class

Fre

quen

cy

0 500 1000 1500 2000 2500 3000

010

020

030

040

0

Household income in Ilocos, Philippines (1998)

Income

Fre

quen

cy

0 500000 1500000 2500000

010

020

030

040

050

0

Page 12: Seeing the forest for the trees, UMons 2011

9/21

/ department of mathematics and computer science

Degree of concentration of functionality

Lorenz curve for SLOC in Hibernate3.6.0-beta4.

% Classes

% S

LOC

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Measure inequality between:I individuals

(e.g., classes)I groups

(e.g., components)

Often desirable to assess thecontribution of the inequalitybetween the groups.

I Decomposable indicesI Root-cause analysis

Page 13: Seeing the forest for the trees, UMons 2011

9/21

/ department of mathematics and computer science

Degree of concentration of functionality

Lorenz curve for SLOC in Hibernate3.6.0-beta4.

IHoover

IGini =A

A

B

A+B = 2A

Measure inequality between:I individuals

(e.g., classes)I groups

(e.g., components)

Often desirable to assess thecontribution of the inequalitybetween the groups.

I Decomposable indicesI Root-cause analysis

Page 14: Seeing the forest for the trees, UMons 2011

9/21

/ department of mathematics and computer science

Degree of concentration of functionality

Lorenz curve for SLOC in Hibernate3.6.0-beta4.

IHoover

IGini =A

A

B

A+B = 2A

Measure inequality between:I individuals

(e.g., classes)I groups

(e.g., components)

Often desirable to assess thecontribution of the inequalitybetween the groups.

I Decomposable indicesI Root-cause analysis

Page 15: Seeing the forest for the trees, UMons 2011

10/21

/ department of mathematics and computer science

Traceability via decomposability

Which individuals (classes in package) contribute to 80% of theinequality (of SLOC)?

Which class contributes the most to the inequality?

Page 16: Seeing the forest for the trees, UMons 2011

11/21

/ department of mathematics and computer science

Other properties of inequality indices

Symmetry

Inequality stays the same for any permutation of the population.

Page 17: Seeing the forest for the trees, UMons 2011

11/21

/ department of mathematics and computer science

Other properties of inequality indices

Symmetry

Inequality stays the same for any permutation of the population.

Page 18: Seeing the forest for the trees, UMons 2011

11/21

/ department of mathematics and computer science

Other properties of inequality indices

Symmetry

Inequality stays the same for any permutation of the population.

Page 19: Seeing the forest for the trees, UMons 2011

12/21

/ department of mathematics and computer science

Other properties of inequality indices

Population principle

Inequality does not change if the population is replicated any number oftimes.

Page 20: Seeing the forest for the trees, UMons 2011

12/21

/ department of mathematics and computer science

Other properties of inequality indices

Population principle

Inequality does not change if the population is replicated any number oftimes.

Page 21: Seeing the forest for the trees, UMons 2011

12/21

/ department of mathematics and computer science

Other properties of inequality indices

Population principle

Inequality does not change if the population is replicated any number oftimes.

Page 22: Seeing the forest for the trees, UMons 2011

13/21

/ department of mathematics and computer science

Other properties of inequality indices

Transfers principle

A transfer from a rich man to a poor man (without reversing theirposition) should decrease inequality.

Page 23: Seeing the forest for the trees, UMons 2011

13/21

/ department of mathematics and computer science

Other properties of inequality indices

Transfers principle

A transfer from a rich man to a poor man (without reversing theirposition) should decrease inequality.

Page 24: Seeing the forest for the trees, UMons 2011

13/21

/ department of mathematics and computer science

Other properties of inequality indices

Transfers principle

A transfer from a rich man to a poor man (without reversing theirposition) should decrease inequality.

Page 25: Seeing the forest for the trees, UMons 2011

13/21

/ department of mathematics and computer science

Other properties of inequality indices

Transfers principle

20 36 45

30 36

A transfer from a rich man to a poor man (without reversing theirposition) should decrease inequality.

Page 26: Seeing the forest for the trees, UMons 2011

14/21

/ department of mathematics and computer science

Other properties of inequality indices

Scale invariance: Gini, Theil, Atkinson, Hoover

Inequality does not change if all values are multiplied by the sameconstant.

Page 27: Seeing the forest for the trees, UMons 2011

14/21

/ department of mathematics and computer science

Other properties of inequality indices

Scale invariance: Gini, Theil, Atkinson, Hoover

Inequality does not change if all values are multiplied by the sameconstant.

Page 28: Seeing the forest for the trees, UMons 2011

15/21

/ department of mathematics and computer science

Summary

Ineq. index Sym. Inv. Dec. Pop. Tra.IGini X × X XITheil X × X X XIMLD X × X X XIHoover X × XIαAtkinson X × X X XIβKolm X + X X X

Problems include:I Domain not always Rn .I No distinction between all values equal but low, and all values

equal but high.

Page 29: Seeing the forest for the trees, UMons 2011

15/21

/ department of mathematics and computer science

Summary

Ineq. index Sym. Inv. Dec. Pop. Tra.IGini X × X XITheil X × X X XIMLD X × X X XIHoover X × XIαAtkinson X × X X XIβKolm X + X X X

Problems include:I Domain not always Rn .I No distinction between all values equal but low, and all values

equal but high.

Page 30: Seeing the forest for the trees, UMons 2011

16/21

/ department of mathematics and computer science

Our research

Page 31: Seeing the forest for the trees, UMons 2011

17/21

/ department of mathematics and computer science

Which are redundant?

IGini, ITheil, IMLD, IAtkinson, and IHoover always convey the same information.-1.0

-0.5

0.0

0.5

1.0

-1.0

-0.5

0.0

0.5

1.0

SLOC

MLD-Hoo Gin-MLD The-MLD Gin-Hoo Atk-Hoo The-Hoo Gin-Atk MLD-Atk Gin-The The-Atk

(91%) (89%) (91%) (90%) (92%) (92%) (90%) (91%) (91%) (92%)

-1.0

-0.5

0.0

0.5

1.0

-1.0

-0.5

0.0

0.5

1.0

DIT

MLD-Hoo Atk-Hoo Gin-MLD The-Hoo Gin-Atk Gin-Hoo Gin-The The-MLD The-Atk MLD-Atk

(85%) (87%) (87%) (88%) (88%) (89%) (88%) (88%) (88%) (89%)

Page 32: Seeing the forest for the trees, UMons 2011

18/21

/ department of mathematics and computer science

Is the correlation meaningful?

Superlinear (e.g., ITheil–IGini) and chaotic (e.g., ITheil–IKolm) patterns canbe observed in the scatter plots.

0.1 0.2 0.3 0.4 0.5 0.6

0.0

0.2

0.4

0.6

0.8

1.0

compiere: Theil-Gini. Kendall: 0.94, p-val: 0.00

Gini (SLOC)

The

il (S

LOC

)

0 50 100 150 200 250 300 350

0.0

0.2

0.4

0.6

0.8

1.0

compiere: Theil-Kolm. Kendall: 0.25, p-val: 0.01

Kolm (SLOC)

The

il (S

LOC

)

Page 33: Seeing the forest for the trees, UMons 2011

19/21

/ department of mathematics and computer science

Does the aggregation level matter?

Changing the aggregation level to class level does not affect thecorrelation between various aggregation techniques as measured atpackage level.

-1.0

-0.5

0.0

0.5

1.0

Kendall: Gini - Theil (SLOC) (100%)

Ken

dall

corr

elat

ion

coef

ficie

nt

-1.0

-0.5

0.0

0.5

1.0

Kendall: Gini - Theil (SLOC) (100%)

Ken

dall

corr

elat

ion

coef

ficie

nt

-1.0

-0.5

0.0

0.5

1.0

Kendall: Theil - Atkinson (SLOC) (100%)

Ken

dall

corr

elat

ion

coef

ficie

nt

-1.0

-0.5

0.0

0.5

1.0

Kendall: Theil - Atkinson (SLOC) (100%)

Ken

dall

corr

elat

ion

coef

ficie

nt

-1.0

-0.5

0.0

0.5

1.0

Kendall: Theil - MLD (SLOC) (100%)

Ken

dall

corr

elat

ion

coef

ficie

nt

-1.0

-0.5

0.0

0.5

1.0

Kendall: Theil - MLD (SLOC) (100%)

Ken

dall

corr

elat

ion

coef

ficie

nt

Page 34: Seeing the forest for the trees, UMons 2011

20/21

/ department of mathematics and computer science

Does system size matter?

System size does influence the correlation between aggregationtechniques, e.g., ITheil–IKolm increases with system size.

0.0

0.2

0.4

0.6

0.8

1.0

hibernate − Kendall(Theil(SLOC), Kolm(SLOC)) (86 releases)

Cor

. coe

ff. T

heil(

SLO

C)

− K

olm

(SLO

C)

0.8.

11.

01.

12.

0−be

ta−

12.

0−be

ta−

22.

0−be

ta−

32.

0−be

ta−

42.

0−fin

al2.

0−rc

22.

0.1

2.0.

22.

0.3

2.1−

beta

−1

2.1−

beta

−2

2.1−

beta

−3

2.1−

beta

−3b

2.1−

beta

−4

2.1−

beta

−5

2.1−

beta

−6

2.1−

final

2.1−

rc1

2.1.

12.

1.2

2.1.

32.

1.4

2.1.

52.

1.6

2.1.

72.

1.8

3.0

3.0−

alph

a3.

0−be

ta1

3.0−

beta

23.

0−be

ta3

3.0−

beta

43.

0−rc

13.

0.1

3.0.

23.

0.3

3.0.

43.

0.5

3.1

3.1−

alph

a13.

1−be

ta1

3.1−

beta

23.

1−be

ta3

3.1−

rc1

3.1−

rc2

3.1−

rc3

3.1.

13.

1.2

3.1.

33.

2−al

pha1

3.2−

alph

a23.

2−cr

13.

2−cr

23.

2.0−

cr3

3.2.

0−cr

43.

2.0−

cr5

3.2.

0.ga

3.2.

1−ga

3.2.

2−ga

3.2.

3−ga

3.2.

4−ga

3.2.

4−sp

13.

2.5−

ga3.

2.6−

ga3.

2.7−

ga3.

3.0−

cr2

3.3.

0−ga

3.3.

0−sp

13.

3.0.

cr1

3.3.

1−ga

3.3.

2−ga

3.5.

0−be

ta−

13.

5.0−

beta

−2

3.5.

0−be

ta−

33.

5.0−

beta

−4

3.5.

0−cr

−1

3.5.

0−cr

−2

3.5.

3−fin

al3.

5.5−

final

3.6.

0−be

ta1

3.6.

0−be

ta2

3.6.

0−be

ta3

3.6.

0−be

ta4

Page 35: Seeing the forest for the trees, UMons 2011

21/21

/ department of mathematics and computer science

References

A. Serebrenik and M. G. J. van den Brand.

Theil index for aggregation of software metrics values.

In Int. Conf. on Software Maintenance, pages 1–9. IEEE, 2010.

B. Vasilescu.

Analysis of advanced aggregation techniques for software metrics.

Master’s thesis, Eindhoven, The Netherlands, July 2011.

B. Vasilescu, A. Serebrenik, and M. G. J. van den Brand.

By no means: A study on aggregating software metrics.

In 2nd International Workshop on Emerging Trends in Software Metrics,Honolulu, Hawaii, USA, 2011.

B. Vasilescu, A. Serebrenik, and M. G. J. van den Brand.

You can’t control the unfamiliar: A study on the relations betweenaggregation techniques for software metrics.

In Int. Conf. on Software Maintenance. IEEE, 2011.

Page 36: Seeing the forest for the trees, UMons 2011

22/21

/ department of mathematics and computer science

Correlation

Linear correlation can be misleading.

5 10 15

46

810

12

Pea: 0.816; Ken: 0.963; Spe: 0.990

●●

●●

●●

●●

5 10 15

46

810

12

Pea: 0.816; Ken: 0.636; Spe: 0.818

●●

●●

5 10 15

46

810

12

Pea: 0.816; Ken: 0.563; Spe: 0.690

●●●

5 10 15

46

810

12

Pea: 0.816; Ken: 0.426; Spe: 0.5

●●

[Vas11, VSvdB11a, SvdB10, VSvdB11b]

Page 37: Seeing the forest for the trees, UMons 2011

22/21

/ department of mathematics and computer science

Correlation

Linear correlation can be misleading.

5 10 15

46

810

12

Pea: 0.816; Ken: 0.963; Spe: 0.990

●●

●●

●●

●●

5 10 15

46

810

12

Pea: 0.816; Ken: 0.636; Spe: 0.818

●●

●●

5 10 15

46

810

12

Pea: 0.816; Ken: 0.563; Spe: 0.690

●●●

5 10 15

46

810

12

Pea: 0.816; Ken: 0.426; Spe: 0.5

●●

[Vas11, VSvdB11a, SvdB10, VSvdB11b]