towards logistic regression models for predicting fault-prone code across software projects erika...

18
Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science and Technology ESEM 2009 ESEM 2009 1

Upload: ursula-potter

Post on 18-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science

Towards Logistic Regression Models for Predicting Fault-prone Code

across Software ProjectsErika Camargo

and Ochimizu Koichiro

Japan Institute of Science and Technology

ESEM 2009ESEM 20091

Page 2: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science

Contents

1. Abstract2. Background3. Problem Analysis4. Case study5. Results6. Conclusion and Future Work

2

Page 3: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science

Abstract

Challenge: To make logistic regression (LR) models, which use design-complexity metrics, able to predict fault-prone o-o classes across software projects.

First attempt of solution: simple log data transformations

P(y=1)

xX = X = design-design-complexity complexity metricmetric

P(Fault prone P(Fault prone class)class)

3

Page 4: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science

Background• Some design-complexity metrics have shown to

be good predictors of fault-prone classes in LR models

• Among these metrics are the Chidamber & Kemerer (CK) metrics

– 80th and 20th percentiles of the distributions can be used to determine high and low values

– Their thresholds cannot be determined before their use and should be derived and used locally

4

Page 5: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science

Problem Analysis

Can a LR model built with these kind of metrics work efficiently with different software projects?

LEAST FAULTY MOST FAULTY

Small Size SW project

Large Size SW project

X = Number of Methods

P (y=1)

105

20

Page 6: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science

Case Study

1. Data analysis of 7 different projects and application of simple log data transformations.

2. Construction of 3 univariate LR models using a large open source project (1st release of the MYLYN System with 638 Java classes).– Dependent Variables: CK-CBO, CK-RFC, CK-WMC– Independent Variables: Defects (from Bugzilla & CVS)

3. Test these models with 2 other smaller projects (with 11 and13 Java classes)

6

Page 7: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science

7

Challenge

(**) Eclipse Project

(*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.

produced biased regression estimates and reduce the predictive power of regression models

BNS: Banking system (2006) *CRS: Cruise control system (2005) *ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)*FACS: Factory automation system (2005) *GMF: Graphic Modeling Framework **MYL : Mylyn system **

Page 8: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science

(**) Eclipse Project

(*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.

RFC Data of BNS is more spread than the data of

the MYL

BNS: Banking system (2006) *CRS: Cruise control system (2005) *ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)*FACS: Factory automation system (2005) *GMF: Graphic Modeling Framework **MYL : Mylyn system **

8

Page 9: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science

(**) Eclipse Project

(*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.

RFC Data of BNS is more spread than the data of

the MYL

BNS: Banking system (2006) *CRS: Cruise control system (2005) *ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)*FACS: Factory automation system (2005) *GMF: Graphic Modeling Framework **MYL : Mylyn system **

9

Page 10: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science

Case Study

Solution. Simple data transformation using “Log10”

Example :

10

Number of Outliers are lessData Spread is more uniform

LCBO = Log10(CBO+1) LTCBO = Log10(CBO+1) + dm;Where dm is the difference of CBO medias of the Mylyn system and the system which data is being transformed

Page 11: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science

Results

Effects of the Log data Transformations:• Elimination of great number of outliers• Overall goodness of fit of the 3 models is

better • Discrimination (Most Faulty/Least Faulty)– All models discriminate well between most Faulty

and Least Faulty classes of the Mylyn System– What about using different projects?

11

Page 12: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science

Results

Group Model Correct Classification (RAW DATA)

Correct Classification(LOG Tx DATA)

Effect

MF(6 classes)

CBO 2 5

RFC 5 5 =

WMC 6 6 =

LF(5 classes)

CBO 5 5 =

RFC 3 3 =

WMC 4 4 =

BOTH(11 classes)

CBO 7 10

RFC 8 8 =

WMC 10 10 =

BANKING SYSTEM

12

MF: Most FaultyLF: Least Faulty

Page 13: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science

Results

Group Model Correct Classification (RAW DATA)

Correct Classification(LOG Tx DATA)

Effect

MF(9 classes)

CBO 3 7

RFC 9 8

WMC 7 6

LF(4 classes)

CBO 4 4 =

RFC 0 3

WMC 0 4

BOTH(13 classes)

CBO 7 11

RFC 9 11

WMC 7 10

E-COMMERCE SYSTEM

13

MF: Most FaultyLF: Least Faulty

Page 14: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science

Conclusions and Future work

• CK-CBO, CKR-RFC ad CK-WMC can have different distributions in different projects

• Simple Log Transformations seem to improve the prediction ability of LR models, specially when the project measures are not as spread as those used in the construction of the model.

• Further data exploration and study of data transformations

14

Page 15: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science

Thank you!questions, comments …

contact: [email protected]

15

Page 16: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science

16

Page 17: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science

17

Page 18: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science

18