bug prediction based on your code history

Bug prediction

based on your code history

Alexey Tokar

VP of Engineering @ WorldAPP

Software

Development

Life Cycle

PHASE 2

Design

PHASE 1

Requirements

analysis

PHASE 3

Development

PHASE 5

Maintenance

PHASE 4

Testing

Zoom in to quality control

● static code analyzers find non conceptual issues

● automated tests cover predefined scenarios

● code review are aimed on sharing and controlling best practices and less than

10% of the discussions discover logical issues.

● and, finally, QA has no idea which parts of a system could be affected by a

code change, neither do a programmer

20bugs in a production environment

per week

Let's try to guess common patterns

● a tired engineer makes more mistakes

● the more an engineer knows about certain module the fewer bugs (s)he will

produce

● small changes have fewer bugs than long listings

● some parts of the system are more complicated than another, so the risk of get

a bug increases

● huge changes in a short period of time contains more bugs (done in a hurry)

What tools do we have across SDLC?

● ticket types● action history

● exact code changes ● author of modifications

● class complexity● code metrics

Hypothesis

If we know that certain commit has fixed a bug, than we know that a commit, when

the changed lines were introduced, did contain a bug.

Author: John

public int sum( int a, int b ) {

return a + b;}

Author: Bob

return a * b;}

return a + b;}

Algorithm of metrics collection

● Export all tasks from Jira to inmemory dictionary

● For each commit run a backtrace to mark it as buggy, fixing or regular

● Collect all meaningful data about the commit:

○ Month of year, Day of week, Hour of day, Who, How many lines and files, Which classes and

packages, Class complexity and amount of notices, How long a task is in progress

● Put a line with the data to Attribute-Relation File Format (ARFF) file

Getting educated. WEKA

Waikato Environment for Knowledge Analysis - is a suite of machine learning

software written in Java, developed at the University of Waikato, New Zealand.

● Parsers

● Classifiers

● Training/test splits

WEKA challenges

● Convert your data to corresponding vectors

● Choose proper data transformers

● Select and tweak desired Classifiers

● Run experiments and adjust your settings

Good materials about WEKA for beginners:

● How to Run Your First Classifier in Weka

● Data mining with WEKA, Part 2. Classification and clustering

● Document Classification using WEKA

Decision Tree

Ease of results interpretation

Any data can be fed to the method

Can work with scalars and intervals

Decision Tree

Changed less than 300 lines?Changed more

than 50 lines?

Author is Bob?

Author is John?

Has no bugs :)

Is it Friday?

Has no bugs :)

Has a bug :(

Has no bugs :)

Has a bug :(

● John never has bugs!

● Everybody except John and Bob has bugs on Friday.

● Bob has bugs only if he changed more than 300 lines of code.

Decision Tree

The simplest method for building a tree is ID3 (Iterative Dichotomiser 3*).

Build steps:

● Find an attribute with lowest entropy (or largest information gain)

● Split the data set by the found attribute

● Recursively build a tree for each of the subsets

* fates of ID2 and ID1 are lost in history

Naive Bayes

classifier

≈80% accuracy*

Simple implementation

Easy to understand

Naive Bayes classifier

30% of all commits with bugs were done by Bob P(Bob|bug)

10% of all commits without bugs were done by Bob P(Bob|~bug)

40% of all commits have bugs P(bug)

60% of all commits have no bugs P(~bug)

What probability that next commit from Bob will have a bug?

P(bug|Bob)

Support Vector

Machine

Better quality of results

The model is based on relations in data

Sounds fancy :)

Support Vector Machine

Kernel trick

http://m

echanoid

WEKA output

results20

Output results example (Bayes)

Correctly Classified Instances 14381 77.4755 %Incorrectly Classified Instances 4181 22.5245 %Kappa statistic 0.3085Mean absolute error 0.2637Root mean squared error 0.3963

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class0.856 0.544 0.861 0.856 0.858 0.761 false0.456 0.144 0.444 0.456 0.45 0.761 true

Weighted Avg. 0.775 0.463 0.777 0.775 0.776 0.761

=== Confusion Matrix ===

a b <-- classified as12670 2140 | a = false2041 1711 | b = true

Integration example

TL;DR: we use GitLab web-hooks and HipChat.

* a long story about SDLC improvements with help of IM bot and a set of

integrations will be available in a week at XPDays conference.

Summary

● we found that certain classes are too complex as almost every change will end

up with a bug

● some of engineers shouldn't open some packages at all (or at least we should

properly educate them)

● there are still many rooms for improvements (overlapping hiding commits,

another meaningful features, more accurate code history, etc)

● It does not show you where the error exists. But you will be able to analyze a

commit more carefully.

● It was fun! :)

Questions?Alexey@Tokar.net.ua

VP of Engineering @ WorldAPP

bug prediction based on your code history

Engineering

assisting code search with automatic query reformulation for...

failure sketching: a technique for automated root cause...

web application security: bug hunting e code review

bug prediction - uzh

analyzing software code and execution – plagiarism and bug...

polyspace bug finder polyspace code prover - mathworks ·...

runway condition code prediction - sesar ju

02. code-decode bug [como hackear servidores]

a developer centered bug prediction model developer centered...

bug prediction based on fine-grained module histories

regression and arima hybrid model for new bug prediction ·...

automated prediction of bug report priority using multi...

bug prediction with neural nets -...

delfic fallout prediction code radiation physics dtic

bug hunting with static code analysis - mwr labs...bug...

an extensive analysis of e icient bug prediction...

a tale of bug prediction in software development

nstatic static code analyzer for bug-finding

automated bug report field reassignment and refinement...

icse 2013 bug prediction session