evaluating coding standards

21-04-23

Challenge the future

DelftUniversity ofTechnology

Evaluating Coding StandardsRelating Violations and Observed Faults

Cathal Boogerd, Software Evolution Research Lab (SWERL)

2/24Evaluating Coding Standards

Coding Standards

• Put the name of a well-known code inspection tool on a poster in the middle of a software development department…

Lots of discussion!• Developer: “I have to do this, but

I don’t have time”

• Architect: “Quality assessed by

stupid rules”

• QA Manager: “Difficult to get

people to use the tool”

A Notorious Subject


Coding Standards

• Rules often based on expert consensus• Gained after long years of experience with ‘faulty’ constructs• Using intricate knowledge of a language

• Rules usually rather straightforward• Making automatic detection feasible• Many tools exist with pre-defined rulesets and support for

customization• QA-C, Codesonar, Findbugs, and many more

• Clearly, this is a simple and sensible preventive approach• Or is it?

Pros; why bother?


Coding Standards

• Automatic code inspection tools often produce many false positives• Situations where it is difficult to see the potential link to a fault• Cases where developers know the construct is harmless

• Solutions to reported violations can take the form of ‘tool satisfaction’• Developers find a workaround to silence the tool, rather than think

about what is actually going on• Any modification has a non-zero probability of introducing a

faultNo empirical evidence supporting the intuition that rules

prevent faults!

Cons; please get those tools away from me!


Concepts

• Violations to coding standard rules point out potential faults• All statements are potentially faulty, but…

• Lines with violations are more likely to be faulty than lines without

• Releases with more violations contain more (latent) faults• Intuitive for two releases of the same software• But: have to account for size, use densities instead

• Modules within one release with higher vd have higher fd• This would point out potential problem areas in the software

• How to gather empirical evidence for these ideas?• Just put a question mark behind them…

Implicit basic idea and its consequences


Research Questions

• Temporal aspect: Do rule violations explain occurrences of faults across releases?• On a project level: rank correlations of releases

• Spatial aspect: Do rule violations explain locations of faults within releases?• Different levels of granularity: rank correlations of file, module

• Combined: Do rule violations explain locations of faults across releases?• On a line level: true positive rates for violations

• We investigate this for the body of violations as a whole, as well as for individual rules


Measurement ApproachRepository Mining

Release

Issues

SourceRepos

IssueDB

ConfigExtraction

Code Inspection

LineTagger

OtherMetrics

Cross-releasecorrelations

TP rates &In-release

correlations

File-versions


Case Study

• TVoM: Platform for watching TV on a mobile phone• DRiver Abstraction Layer (DRAL): approx. 90KLoC in C• 214 daily build releases, ~460 PRs

• Vproc: video processing part of TV software• Vproc (developed in Ehv): approx. 650KLoC in C• 41 internal releases, ~310 PRs

• SCM: Telelogic Synergy• Features link between issue reports and modified files

• Both embedded software projects within NXP, but:• Vproc larger and more mature (productline vs new project)

Projects: TVoM and Vproc


Case Study

• Coding standard based on the notion of a safer language subset• Banning potentially unsafe constructs• Placing special emphasis on automatic checking

• Initially meant for the automotive industry• MISRA-C 98 by MIRA, a UK-based consortium of automotive

industries• Widely adopted in industry, also outside automotive

• In 2004 the current, revised version was released

Coding Standard: MISRA-C: 2004


Measurement Approach

• Obtain the following measures per release:• Number of violations, number of faults, and size (LoC)

• Violations (also per rule):• Measure by running code inspection tool for MISRA (QA-C)

• Faults:• Estimate by taking number of open issues at release date• This is a conservative approximation!

• Size:• Measure the number of physical lines of code• We opt for physical, since rules need not be limited to

statements• Note that this does not require the issue database

Temporal Aspect



• Similarly, we need to obtain faults/violations per file/module• Measure violation density as before

• Estimating the number of faults by tracking faulty lines• Extract all used files from the selected releases• Retrieve all versions of those files from the repository• Create a file-version graph with a diff for each edge• Use file-version graph and diffs to track faulty lines to their

origin• Fault is assumed to be present from first occurrence of

one of its constituting lines until conclusion of the issue

Spatial Aspect


Measurement ApproachFile-version and annotation graphs

foo.c,1

foo.c,2

foo.c,2.1.1

foo.c,2.1.2

foo.c,2.2.1

foo.c,2.2.3

foo.c,2.2.2.2.1

foo.c,2.2.2.1.1

foo.c,2.1.3



• Matches violations and faults on a line-basis, by tracking violations similar to faults in the Spatial approach

• The true positive rate is

# true positives / # violations

• A ‘true positive’ is a violation that correctly predicted the line containing it to be faulty, i.e. part of a bug fix

• The number of violations is the unique number over the whole history• Defined by the violation id and the specific line containing it

• How to assess the significance of the true positive rate?

Temporal-Spatial Aspect



• Suppose a certain rule marks every line as a violation…• In this case the true positive rate will be equal to the faulty line

ratio• In general: a random line predictor will end up around that

ratio• Given a sufficient number of attempts

• We need to determine whether violations outperform a uniform random line predictor• Random predictor can be modeled as a Bernoulli process• p = faulty line ratio, #attempts = #violations, #successes = #TPs• Distribution is binomial, use CDF to determine significance of #TPs

Significance of line-based prediction


Results for TVoMEvolution of measures over time

violations size (loc) faults


Results for TVoM

• No relation in the first part of the project, but there is one in the second part

• Rank correlation: 0.76, R2 = 0.57, significant

• Individual rules:• 73 unique rules found• 9 negative, 23 none, 41

positive

Cross-release correlation


Results for TVoM

• Out of 72 rules, 13 had a TP > faulty line rate (0.17)• Of which 11 significant with α = 0.05

• Although better than random, this does not say anything about applicability• For instance, rule 14.2 has 260 violations, of which 70% false

positive• On average, this requires about 10 tries before one is successful• To be relatively sure (α = 0.05) requires selection of 26 violations

• However, work load issues can be addressed by process design• Automatic run of code inspection upon check-in• Developer would only inspect violations of his delta (more context)• In that case, true positive rates can be useful for prioritization

True positive rates


Results for VprocEvolution of faults and violations over time

violations size (loc) faults


Results for Vproc

• No overall relation, only for some rules

• Individual rules• 89 distinct rule violations• 15 negative, 59 none, 15

positive

Cross-release correlation


Results for Vproc

• Out of 78 rules, 55 had a TP > faulty line rate (0.0005)• Of which 29 significant with α = 0.05

• Faulty line rate is very different from TVoM!• Mature code, many files never modified• Does the assumption of uniform distribution still hold?

• Analyzed addition in isolation (i.e., modified files only)• Faulty line rate becomes 0.06• Now only 40 rules have a TP > faulty line rate, 14 significant

• NB: some rules have very few violations• Easily outperforms random predictor (but not significant)

True positive rates


Conclusions

• Found some evidence for a relation between violations and faults• In both cases, but especially in TVoM• At this point, no pattern of rules stands out

• However, no consistent behavior of rules for the two cases• More cases are needed to increase confidence in results• A priori rule selection currently not possible

• Temporal method easier to apply, but some problems:• Inaccurate estimate of number of faults• Too sensitive to changes other than fault-fixes

Lessons learned


Conclusions

• Note that (negative) correlations do not mean causation!• Write a C++ style comment for every fault fix• Other (non-fix) modifications might obscure the correlation

• Spatial methods may be too restrictive• Not all modified/deleted lines in a fault-fix are faulty• Sometimes fault-fixes only introduce new code; unable to

locate fault• Must take care in selection of codebase to analyze

• Preliminary in-release results indicate no correlation

Lessons learned


Conclusions

• Note that there may be more reasons than fault prevention to adhere to a coding standard• Maintainability: readability, common style• Portability: minimize issues due to compiler changes

• Nevertheless, quantification of fault prevention can be an important asset in the cost-benefit analysis of adherence

• You may have noticed not all results were in the slides: work in progress!

Final remarks


Found any bugs in this presentation?

evaluating coding standards

Documents