nearest neighbor sampling for better defect prediction gary d. boetticher department of software...

20
Nearest Neighbor Sampling for Nearest Neighbor Sampling for Better Defect Prediction Better Defect Prediction Gary D. Boetticher Department of Software Engineering University of Houston - Clear Lake Houston, Texas, USA

Upload: skyla-swindells

Post on 01-Apr-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Nearest Neighbor Sampling for Better Defect Prediction Gary D. Boetticher Department of Software Engineering University of Houston - Clear Lake Houston,

Nearest Neighbor Sampling for Better Nearest Neighbor Sampling for Better Defect PredictionDefect Prediction

Gary D. BoetticherDepartment of Software EngineeringUniversity of Houston - Clear Lake

Houston, Texas, USA

Page 2: Nearest Neighbor Sampling for Better Defect Prediction Gary D. Boetticher Department of Software Engineering University of Houston - Clear Lake Houston,

The Problem: The Problem: Why is there not more ML in Why is there not more ML in Software Engineering?Software Engineering?

Human-Based62 to 86% [Jørgensen 2004]

AlgorithmicMachine

Learning7 to 16%

Page 3: Nearest Neighbor Sampling for Better Defect Prediction Gary D. Boetticher Department of Software Engineering University of Houston - Clear Lake Houston,

Key IdeaKey Idea

More ML in SE through a more More ML in SE through a more defined experimental process.defined experimental process.

Page 4: Nearest Neighbor Sampling for Better Defect Prediction Gary D. Boetticher Department of Software Engineering University of Houston - Clear Lake Houston,

AgendaAgenda

A better defined process for better predicting (quality)

Experiments: Nearest Neighbor Sampling on PROMISE

Defect data sets

Extending the approach

Discussion

Conclusions

Page 5: Nearest Neighbor Sampling for Better Defect Prediction Gary D. Boetticher Department of Software Engineering University of Houston - Clear Lake Houston,

A Better Defined ProcessA Better Defined Process

Emphasis of ML approachesEmphasis on Measuring Success

– PRED(X)– Accuracy– MARE

Prediction success depends upon the relationship between training and test data.

Page 6: Nearest Neighbor Sampling for Better Defect Prediction Gary D. Boetticher Department of Software Engineering University of Houston - Clear Lake Houston,

PROMISE Defect Data (from NASA)PROMISE Defect Data (from NASA)Project Code Description

CM1 C NASA spacecraft instrumentKC1 C++ Storage management for receiving/processing ground dataKC2 C++ Science data processing. No software overlap with KC1.JM1 C Real-time predictive ground systemPC1 C Flight software for earth orbiting satellite

21 Inputs– Size (SLOC, Comments)– Complexity (McCabe Cyclomatic Comp.)– Vocabulary (Halstead Operators, Operands)

1 Output: Number of Defects

Page 7: Nearest Neighbor Sampling for Better Defect Prediction Gary D. Boetticher Department of Software Engineering University of Houston - Clear Lake Houston,

Data PreprocessingData Preprocessing

ProjectOriginal

SizeSize w/ No

Bad, No Dups0

Defects1+

Defects%

Defects

CM1 498 441 393 48 10.9%JM1 10,885 8911 6904 2007 22.5%KC1 2109 1211 896 315 26.0%KC2 522 374 269 105 28.1%PC1 1109 953 883 70 7.3%

Reduced to 2 classes

Page 8: Nearest Neighbor Sampling for Better Defect Prediction Gary D. Boetticher Department of Software Engineering University of Houston - Clear Lake Houston,

Experiment 1Experiment 1

6904 with

0 Defects

2007with

1+ Defects

JM1

}22%

Training40% of Original Data

Nice Test Nasty Test

Page 9: Nearest Neighbor Sampling for Better Defect Prediction Gary D. Boetticher Department of Software Engineering University of Houston - Clear Lake Houston,

Experiment 1 ContinuedExperiment 1 ContinuedTraining

Nice Test Nasty Test

Remaining Vectorsfrom Data set

Remaining Vectorsfrom Data set

Page 10: Nearest Neighbor Sampling for Better Defect Prediction Gary D. Boetticher Department of Software Engineering University of Houston - Clear Lake Houston,

J48 and Naïve Bayes Classifiers from WEKA200 Trials (100 Nice Test Data + 100 Nasty Test Data)

– CM1– JM1– KC1– KC2– PC1

Experiment 1 ContinuedExperiment 1 Continued

20 Nice Trials + 20 Nasty Trials

Page 11: Nearest Neighbor Sampling for Better Defect Prediction Gary D. Boetticher Department of Software Engineering University of Houston - Clear Lake Houston,

Results: AccuracyResults: AccuracyNice Test Set Nasty Test Set

J48 NaïveBayes

J48 NaïveBayes

CM1 97.4% 88.3% 6.2% 37.4%JM1 94.6% 94.8% 16.3% 17.7%KC1 90.9% 87.5% 22.8% 30.9%KC2 88.3% 94.1% 42.3% 36.0%PC1 97.8% 91.9% 19.8% 35.8%

OverallAverage

94.4% 93.6% 18.7% 21.2%

Page 12: Nearest Neighbor Sampling for Better Defect Prediction Gary D. Boetticher Department of Software Engineering University of Houston - Clear Lake Houston,

Results: Average Confusion MatrixResults: Average Confusion Matrix

J48 Naïve Bayes2 3 3 2

58 1021 68 1011

J48 Naïve Bayes50 249 60 2412 7 3 5

Average Nice Results

Average Nasty Results

Note thedistribution:

0 Defects

1+ Defects

Page 13: Nearest Neighbor Sampling for Better Defect Prediction Gary D. Boetticher Department of Software Engineering University of Houston - Clear Lake Houston,

Experiment 2: 60% Train, KNN=3Experiment 2: 60% Train, KNN=3Accuracy

NeighborDescription

# ofTRUEs

# ofFALSEs J48

NaïveBayes

PPP None None NA NA

PPN 0 354 88 90

PNP 0 5 40 20

NPP None None NA NA

PNN 3 0 100 0

NPN 13 0 31 100

NNP 110 0 25 28

NNN None None NA NA

Page 14: Nearest Neighbor Sampling for Better Defect Prediction Gary D. Boetticher Department of Software Engineering University of Houston - Clear Lake Houston,

Assessing Experiment DifficultyAssessing Experiment Difficulty

Exp_Difficulty = 1 - Matches / Total_Test_Instances

Match = Test vector’s nearest neighbor is from the same class instance in the training set.

Experimental Difficulty = 1

Experimental Difficulty = 0

Hard experiment

Easy experiment

Page 15: Nearest Neighbor Sampling for Better Defect Prediction Gary D. Boetticher Department of Software Engineering University of Houston - Clear Lake Houston,

Assessing Overall Data DifficultyAssessing Overall Data Difficulty

Overall Data Difficulty = 1 - Matches / Total_Data_Instances

Match = A data vector’s nearest neighbor is from the same class instance as another vector in the data set.

Overall Data Difficulty = 1

Overall Data Difficulty = 0

Difficult Data

Easy Data

Page 16: Nearest Neighbor Sampling for Better Defect Prediction Gary D. Boetticher Department of Software Engineering University of Houston - Clear Lake Houston,

Discussion: Anticipated BenefitsDiscussion: Anticipated BenefitsMethod for characterizing difficulty of

experimentMore realistic modelsEasy to implementCan be integrated into N-Way Cross ValidationCan apply to various types of SE data sets:

– Defect Prediction– Effort Estimation

Can be extended beyond SE to other domains

Page 17: Nearest Neighbor Sampling for Better Defect Prediction Gary D. Boetticher Department of Software Engineering University of Houston - Clear Lake Houston,

Discussion: Potential ProblemsDiscussion: Potential Problems

More work needs to be doneAgreement on how to measure Experimental

DifficultyExtra overheadImplicitly or Explicitly Data Staved Domain

Page 18: Nearest Neighbor Sampling for Better Defect Prediction Gary D. Boetticher Department of Software Engineering University of Houston - Clear Lake Houston,

How to get more ML in SE?

ConclusionsConclusions

Assess experiments/data for their difficulty

Benefits:More credibility to the modeling processMore reliable predictorsMore realistic models

Page 19: Nearest Neighbor Sampling for Better Defect Prediction Gary D. Boetticher Department of Software Engineering University of Houston - Clear Lake Houston,

Thanks to the reviewers for their comments!

AcknowledgementsAcknowledgements

Page 20: Nearest Neighbor Sampling for Better Defect Prediction Gary D. Boetticher Department of Software Engineering University of Houston - Clear Lake Houston,

1) M. Jørgensen, A Review of Studies on Expert Estimation of Software Development Effort, Journal Systems and Software, Vol 70, Issues 1-2, 2004, Pp. 37-60.

ReferencesReferences