bug prediction based on your code history

25
Bug prediction based on your code history Alexey Tokar VP of Engineering @ WorldAPP

Upload: alexey-tokar

Post on 23-Jan-2018

166 views

Category:

Engineering


1 download

TRANSCRIPT

Page 1: Bug prediction based on your code history

Bug prediction

based on your code history

Alexey Tokar

VP of Engineering @ WorldAPP

Page 2: Bug prediction based on your code history

SDLC

2

Software

Development

Life Cycle

PHASE 2

Design

PHASE 1

Requirements

analysis

PHASE 3

Development

PHASE 5

Maintenance

PHASE 4

Testing

Page 3: Bug prediction based on your code history

Zoom in to quality control

● static code analyzers find non conceptual issues

● automated tests cover predefined scenarios

● code review are aimed on sharing and controlling best practices and less than

10% of the discussions discover logical issues.

● and, finally, QA has no idea which parts of a system could be affected by a

code change, neither do a programmer

3

Page 4: Bug prediction based on your code history

20bugs in a production environment

per week

4

Page 5: Bug prediction based on your code history

Let's try to guess common patterns

● a tired engineer makes more mistakes

● the more an engineer knows about certain module the fewer bugs (s)he will

produce

● small changes have fewer bugs than long listings

● some parts of the system are more complicated than another, so the risk of get

a bug increases

● huge changes in a short period of time contains more bugs (done in a hurry)

5

Page 6: Bug prediction based on your code history

What tools do we have across SDLC?

6

● ticket types● action history

● exact code changes ● author of modifications

● class complexity● code metrics

Page 7: Bug prediction based on your code history

Hypothesis

If we know that certain commit has fixed a bug, than we know that a commit, when

the changed lines were introduced, did contain a bug.

7

Author: John

public int sum( int a, int b ) {

return a + b;}

C

Author: Bob

public int sum( int a, int b ) {

return a * b;}

BA

public int sum( int a, int b ) {

return a + b;}

Page 8: Bug prediction based on your code history

Algorithm of metrics collection

● Export all tasks from Jira to inmemory dictionary

● For each commit run a backtrace to mark it as buggy, fixing or regular

● Collect all meaningful data about the commit:

○ Month of year, Day of week, Hour of day, Who, How many lines and files, Which classes and

packages, Class complexity and amount of notices, How long a task is in progress

● Put a line with the data to Attribute-Relation File Format (ARFF) file

8

Page 9: Bug prediction based on your code history

Getting educated. WEKA

Waikato Environment for Knowledge Analysis - is a suite of machine learning

software written in Java, developed at the University of Waikato, New Zealand.

● Parsers

● Classifiers

● Training/test splits

9

Page 10: Bug prediction based on your code history

WEKA challenges

● Convert your data to corresponding vectors

● Choose proper data transformers

● Select and tweak desired Classifiers

● Run experiments and adjust your settings

Good materials about WEKA for beginners:

● How to Run Your First Classifier in Weka

● Data mining with WEKA, Part 2. Classification and clustering

● Document Classification using WEKA

10

Page 11: Bug prediction based on your code history

Decision Tree

Ease of results interpretation

Any data can be fed to the method

Can work with scalars and intervals

11

Page 12: Bug prediction based on your code history

Decision Tree

12

Changed less than 300 lines?Changed more

than 50 lines?

Author is Bob?

Author is John?

Has no bugs :)

Has no bugs :)

Is it Friday?

Has no bugs :)

Has a bug :(

Has no bugs :)

Has a bug :(

● John never has bugs!

● Everybody except John and Bob has bugs on Friday.

● Bob has bugs only if he changed more than 300 lines of code.

Page 13: Bug prediction based on your code history

Decision Tree

13

The simplest method for building a tree is ID3 (Iterative Dichotomiser 3*).

Build steps:

● Find an attribute with lowest entropy (or largest information gain)

● Split the data set by the found attribute

● Recursively build a tree for each of the subsets

* fates of ID2 and ID1 are lost in history

Page 14: Bug prediction based on your code history

Naive Bayes

classifier

≈80% accuracy*

Simple implementation

Easy to understand

14

Page 15: Bug prediction based on your code history

Naive Bayes classifier

15

Page 16: Bug prediction based on your code history

Naive Bayes classifier

16

30% of all commits with bugs were done by Bob P(Bob|bug)

10% of all commits without bugs were done by Bob P(Bob|~bug)

40% of all commits have bugs P(bug)

60% of all commits have no bugs P(~bug)

What probability that next commit from Bob will have a bug?

P(bug|Bob)

Page 17: Bug prediction based on your code history

Support Vector

Machine

Better quality of results

The model is based on relations in data

Sounds fancy :)

17

Page 18: Bug prediction based on your code history

Support Vector Machine

18

Page 19: Bug prediction based on your code history

Support Vector Machine

19

Kernel trick

http://m

echanoid

.kie

v.u

a/m

l-svm

.htm

l

Page 20: Bug prediction based on your code history

WEKA output

results20

Page 21: Bug prediction based on your code history

Output results example (Bayes)

Correctly Classified Instances 14381 77.4755 %Incorrectly Classified Instances 4181 22.5245 %Kappa statistic 0.3085Mean absolute error 0.2637Root mean squared error 0.3963

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class0.856 0.544 0.861 0.856 0.858 0.761 false0.456 0.144 0.444 0.456 0.45 0.761 true

Weighted Avg. 0.775 0.463 0.777 0.775 0.776 0.761

=== Confusion Matrix ===

a b <-- classified as12670 2140 | a = false2041 1711 | b = true

21

Page 22: Bug prediction based on your code history

Output results example (RandomTree)form < 1| Registration < 1| | [email protected] < 1| | | tpl < 1| | | | filters < 1| | | | | [email protected] < 1| | | | | | [email protected] < 1 : false| | | | | | [email protected] >= 1 : true| | | | | [email protected] >= 1| | | | | | ObjectDesign < 1 : true| | | | | | ObjectDesign >= 1 : false| | | | filters >= 1 : false| | | tpl >= 1 : true| | [email protected] >= 1| | | bundle < 1| | | | xmail < 1| | | | | general < 1| | | | | | dataimport < 1| | | | | | | oracle < 1 : false| | | | | | | oracle >= 1 : true| | | | | | dataimport >= 1 : false| | | | | general >= 1| | | | | | filesedited < 2 : false| | | | | | filesedited >= 2 : true| | | | xmail >= 1 : false| | | bundle >= 1 : true| Registration >= 1 : true

22

Page 23: Bug prediction based on your code history

Integration example

TL;DR: we use GitLab web-hooks and HipChat.

* a long story about SDLC improvements with help of IM bot and a set of

integrations will be available in a week at XPDays conference.

23

Page 24: Bug prediction based on your code history

Summary

● we found that certain classes are too complex as almost every change will end

up with a bug

● some of engineers shouldn't open some packages at all (or at least we should

properly educate them)

● there are still many rooms for improvements (overlapping hiding commits,

another meaningful features, more accurate code history, etc)

● It does not show you where the error exists. But you will be able to analyze a

commit more carefully.

● It was fun! :)

24

Page 25: Bug prediction based on your code history

[email protected]

VP of Engineering @ WorldAPP

25