automated fault prediction the ins, the outs, the ups, the downs elaine weyuker june 11, 2015

Post on 11-Jan-2016

219 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Automated Fault Prediction

The Ins, The Outs, The Ups, The Downs

Elaine WeyukerJune 11, 2015

To determine which files of a large software system with multiple releases are likely to contain the largest numbers of bugs in the next release.

Help testers prioritize testing efforts.

Help developers decide when to do design and code reviews and what to reimplement.

Help managers allocate resources.

Verified that bugs were non-uniformly distributed among files.

Identified properties that were likely to affect fault-proneness, and then built a statistical model and ultimately a tool to make predictions.

● Size of file (KLOCs)● Number of changes to the file in the

previous 2 releases.● Number of bugs in the file in the last

release.● Age of file (Number of releases in the

system)● Language the file is written in.

● All of the systems we’ve studied to date use a configuration management system which integrates version control and change management functionality, including bug history.

● Data is automatically extracted from the associated data repository and passed to the prediction engine.

Used Negative Binomial Regression

Also considered machine learning algorithms including:◦ Recursive Partitioning◦ Random Forests◦ BART (Bayesian Additive Regression Trees)

● Consists of two parts.

● The back end extracts data needed to make the predictions.

● The front end makes the predictions and displays them.

Extracts necessary data from the repository.

Predicts how many bugs will be in each file in the next release of the system.

Sorts to files in decreasing order of the number of predicted bugs.

Displays results to user.

Percentage of actual bugs that occurred in the N% of the files predicted to have the largest number of bugs. (N=20)

Considered other measures less sensitive to the specific value of N.

System Years Followed

Releases LOC % Faults Top 20%

NP 4 17 538K 83%

WN 2 9 438K 83%

VT 2.25 9 329K 75%

TS 9+ 35 442K 81%

TW 9+ 35 384K 93%

TE 7 27 327K 76%

IC 4 18 1520K 91%

AR 4 18 281K 87%

IN 4 18 2116K 93%

The Tool

Prediction Engine

Statistical Analysis

Version Mgmt /Fault Database

(previous releases)

Release to be predicted

User-supplied parameters

Fault-proneness predictions

User enters system name.

User asks for fault predictions for release “Bluestone2008.1”

Available releases are found in the version mgmt database. User chooses the releases to analyze.

User selects 4 file types.

User specifies that all problems reported in System Test phase are faults.

User confirms configuration

User enters filename to save the configuration.

User clicks Save & Run button, to start the prediction process.

Initial prediction view for Bluestone2008.1

All files are listed in decreasing order of predicted faults

Listing is restricted to eC files

Listing is restricted to 10% of eC files

Prediction tool is fully-operational◦ 750 lines Python for interface◦ 2150 lines C, 75K bytes compiled for prediction

engine

Current version’s backend (written in C) is specific for the internal AT&T configuration management system but can be adapted to other configuration management systems. All that is needed is a source of the data required by the prediction model.

Variations of the Fault Prediction Model

Developers◦ Counts◦ Individuals

Amount of Code Change

Calling Structure

1. Standard model2. Developer counts3. Individual developers4. Line-level change metrics5. Calling structure

Overview

Underlying statistical model◦ Negative binomial regression

Output (dependent) variable◦ Predicted fault count in each file of release n

Predictor (independent) variables◦ KLOC (n)◦ Previous faults (n-1)◦ Previous changes (n-1, n-2)◦ File age (number of releases)◦ File type (C,C++,java,sql,make,sh,perl,...)

The Standard Model

How many different people have worked on the file in the most recent previous release?

How many different people have worked on the file in all previous releases? This is a cumulative count.

How many people who changed the file were working on it for the first time?

Developer counts

Faults per file in releases of System BTS

Standard Model

Developers Changing File in Previous Release

New Developers Changing File in Previous Release

Total Developers Changing File in All Previous Releases

Total developers touching file in all previous releases

None of the developer count attributes uniformly increases prediction accuracy. In all cases, adding a developer count attribute to the standard model sometimes leads to less accurate predictions than the standard model alone. The benefit is never major.

Summary

The standard model includes a count of the number of changes made in the previous two releases. It does not take into account how much code was changed.

We will now look at the impact on predictive accuracy of adding to the model fine-grained information about change size.

Code Change

Number of changes made to a file during a previous release

Number of lines added Number of lines deleted Number of lines modified Relative size of change (line changes/LOC)

Changed/not changed

Measures of Code Change

18 releases, 5 year lifespan

IC: Large provisioning system 6 languages: Java (60%), C, C++, SQL, SQL-C, SQL-C++ 3000+ files 1.5Mil LOC Average of 395 faults/release

AR: Utility, data aggregation system >10 languages: Java (77%), Perl, xml, sh, ... 800 files 280K LOC Average of 90 faults/release

Two Subject Systems

Distribution of files,averages over all releases.

System IC Faults per File, by Release

System AR Faults per File, by Release

Univariate models

Base model: log(KLOC), File age, File type

Augmented models:◦ Previous Changes◦ Previous {Adds / Deletes / Mods}◦ Previous {Adds / Deletes / Mods} / LOC (relative

churn)◦ Previous Developers

Prediction Models with Line-level Change Counts

Fault-percentile averages for univariate predictor models: System IC

Base Model and Added Variables: System IC

• Base model: KLOC, File age (number of releases), File type (C,C++,java,sql,make,sh,perl,...)

Base Model and Added Variables: System AR

Change information provides important information for fault predictions

{Adds+Deletes+Mods} improves the accuracy of a model that doesn’t include any change information

BUT a simple count of prior changes slightly outperforms

{Adds+Deletes+Mods} Prior changed (a simple binary variable) is nearly as good

as either, when added to a model without change info Lines added is the most effective single change predictor Lines deleted is least effective single change predictor Relative changes is no better than absolute changes for

predicting total fault count

Summary

Individual Developers

How can we measure the effect that a single developer has on the faultiness of a file?

If developer d modifies k files in release N how many of those files have bugs in

release N+1? how many bugs are in those files in release

N+1?

The BuggyFile Ratio

If d modifies k files in release N, and if b of them have bugs in release N+1, the buggyfile ratio for d is b/k

System IC has 107 programmers.

Over 15 releases, their buggyfile ratios vary between 0 and 1

The average is about 0.4

Average buggyfile ratio, all programmers

Buggyfile ratio for two programmers

Buggyfile ratiomore typical cases

The Bug Ratio

If d modifies k files in release N, and if there are B bugs in those files in release N+1, the bug ratio for d is B/k

The bug ratio can vary between 0 and B

Over 15 releases, we’ve seen a maximum bug ratio of about 8

The average is about 1.5

Bug Ratio

Buggyfile Ratio

Problems with these definitions

A file can be changed by more than one developer.

A file may be changed in Rel N and a fault detected in N+1, but that change may not have caused that fault.

A programmer might change many files in the identical trivial ways (interface, variable name, ...)

The “best” programmers might be assigned to work on the most difficult files.

For most programmers, the bug ratios vary widely from release to release.

• Is individual programmer bug-proneness helpful for prediction?

• Is this information useful for helping a project succeed?

• Are there better ways to measure it?

• Is it ethical to measure it?

• Does attempting to measure it lead to poor performance and unhappy programmers?

Some final thoughts

Are files that have high rate of interaction with other files more fault-prone?

Calling Structure

File Q

Method 1

Method 2

File X

File Y

File Z

Callees of File Q

File A

File B

Callers of File Q

For each file:

number of callers & callees number of new callers & callees number of prior new callers & callees number of prior changed callers & callees number of prior faulty callers & callees ratio of internal calls to total calls

Calling Structure Attributes Investigated

Code and history attributes, no calling structure

Code and history attributes, including calling structure

Code attributes only, including calling structure

Fault Prediction by Multi-variable Models

Models applied to C, C++, and C-SQL files of one of the systems studied.

First model built from the single best attribute.

Each succeeding model built by adding the attribute that most improves the prediction.

Stop when no attribute improves.

Fault Prediction by Multi-variable Models

Code and history attributes, no calling structure

Code, history, and calling structure attributes

Code and calling structure attributes but not numbers of faults or changes in previous releases.

Calling structure attributes do not increase the accuracy of predictions.

History attributes (prior changes, prior faults) increase accuracy, either with or without calling structure.

We only studied these issues for two of the systems.

Summary

The Standard Model performs very well (on all nine industrial systems we have examined)

The augmented models add very little or no additional accuracy

Cumulative developers is the most effective addition to the Standard Model, but still doesn’t guarantee improved prediction or yield significant improvement.

Overall Summary

◦ Will our standard model make accurate predictions for open-source systems?

◦Will our standard model make accurate predictions agile systems?

◦ Can we predict which files will contain the faults with the highest severities?

◦ Can predictions be made for units smaller than files?

◦ Can run-time attributes be used to make fault predictions? (execution time, execution frequency, memory use, …)

◦ What is the most meaningful way to assess the effectiveness and accuracy of the predictions?

What’s Ahead?

top related