evaluating coding standards

Click here to load reader

Post on 19-Jan-2016




0 download

Embed Size (px)


Evaluating Coding Standards. Relating Violations and Observed Faults. Cathal Boogerd, Software Evolution Research Lab (SWERL). Coding Standards. A Notorious Subject. Put the name of a well-known code inspection tool on a poster in the middle of a software development department… - PowerPoint PPT Presentation


PowerPoint-presentatieCathal Boogerd, Software Evolution Research Lab (SWERL)
Coding Standards
Put the name of a well-known code inspection tool on a poster in the middle of a software development department…
Lots of discussion!
Developer: “I have to do this, but I don’t have time”
Architect: “Quality assessed by stupid rules”
QA Manager: “Difficult to get people to use the tool”
A Notorious Subject
Gained after long years of experience with ‘faulty’ constructs
Using intricate knowledge of a language
Rules usually rather straightforward
Making automatic detection feasible
Many tools exist with pre-defined rulesets and support for customization
QA-C, Codesonar, Findbugs, and many more
Clearly, this is a simple and sensible preventive approach
Or is it?
Pros; why bother?
Automatic code inspection tools often produce many false positives
Situations where it is difficult to see the potential link to a fault
Cases where developers know the construct is harmless
Solutions to reported violations can take the form of ‘tool satisfaction’
Developers find a workaround to silence the tool, rather than think about what is actually going on
Any modification has a non-zero probability of introducing a fault
No empirical evidence supporting the intuition that rules prevent faults!
Cons; please get those tools away from me!
All statements are potentially faulty, but…
Lines with violations are more likely to be faulty than lines without
Releases with more violations contain more (latent) faults
Intuitive for two releases of the same software
But: have to account for size, use densities instead
Modules within one release with higher vd have higher fd
This would point out potential problem areas in the software
How to gather empirical evidence for these ideas?
Just put a question mark behind them…
Implicit basic idea and its consequences
Temporal aspect: Do rule violations explain occurrences of faults across releases?
On a project level: rank correlations of releases
Spatial aspect: Do rule violations explain locations of faults within releases?
Different levels of granularity: rank correlations of file, module
Combined: Do rule violations explain locations of faults across releases?
On a line level: true positive rates for violations
We investigate this for the body of violations as a whole, as well as for individual rules
How to do this? Measurement approach
DRiver Abstraction Layer (DRAL): approx. 90KLoC in C
214 daily build releases, ~460 PRs
Vproc: video processing part of TV software
Vproc (developed in Ehv): approx. 650KLoC in C
41 internal releases, ~310 PRs
SCM: Telelogic Synergy
Both embedded software projects within NXP, but:
Vproc larger and more mature (productline vs new project)
Projects: TVoM and Vproc
Case Study
Coding standard based on the notion of a safer language subset
Banning potentially unsafe constructs
MISRA-C 98 by MIRA, a UK-based consortium of automotive industries
Widely adopted in industry, also outside automotive
In 2004 the current, revised version was released
Coding Standard: MISRA-C: 2004
Number of violations, number of faults, and size (LoC)
Violations (also per rule):
Estimate by taking number of open issues at release date
This is a conservative approximation!
Measure the number of physical lines of code
We opt for physical, since rules need not be limited to statements
Note that this does not require the issue database
Temporal Aspect
Measure violation density as before
Estimating the number of faults by tracking faulty lines
Extract all used files from the selected releases
Retrieve all versions of those files from the repository
Create a file-version graph with a diff for each edge
Use file-version graph and diffs to track faulty lines to their origin
Fault is assumed to be present from first occurrence of one of its constituting lines until conclusion of the issue
Spatial Aspect
Measurement Approach
Matches violations and faults on a line-basis, by tracking violations similar to faults in the Spatial approach
The true positive rate is
# true positives / # violations
A ‘true positive’ is a violation that correctly predicted the line containing it to be faulty, i.e. part of a bug fix
The number of violations is the unique number over the whole history
Defined by the violation id and the specific line containing it
How to assess the significance of the true positive rate?
Temporal-Spatial Aspect
Suppose a certain rule marks every line as a violation…
In this case the true positive rate will be equal to the faulty line ratio
In general: a random line predictor will end up around that ratio
Given a sufficient number of attempts
We need to determine whether violations outperform a uniform random line predictor
Random predictor can be modeled as a Bernoulli process
p = faulty line ratio, #attempts = #violations, #successes = #TPs
Distribution is binomial, use CDF to determine significance of #TPs
Significance of line-based prediction
Evaluating Coding Standards
Results for TVoM
No relation in the first part of the project, but there is one in the second part
Rank correlation: 0.76, R2 = 0.57, significant
Individual rules:
Cross-release correlation
Evaluating Coding Standards
Results for TVoM
Out of 72 rules, 13 had a TP > faulty line rate (0.17)
Of which 11 significant with α = 0.05
Although better than random, this does not say anything about applicability
For instance, rule 14.2 has 260 violations, of which 70% false positive
On average, this requires about 10 tries before one is successful
To be relatively sure (α = 0.05) requires selection of 26 violations
However, work load issues can be addressed by process design
Automatic run of code inspection upon check-in
Developer would only inspect violations of his delta (more context)
In that case, true positive rates can be useful for prioritization
True positive rates
Individual rules
Cross-release correlation
Evaluating Coding Standards
Results for Vproc
Out of 78 rules, 55 had a TP > faulty line rate (0.0005)
Of which 29 significant with α = 0.05
Faulty line rate is very different from TVoM!
Mature code, many files never modified
Does the assumption of uniform distribution still hold?
Analyzed addition in isolation (i.e., modified files only)
Faulty line rate becomes 0.06
Now only 40 rules have a TP > faulty line rate, 14 significant
NB: some rules have very few violations
Easily outperforms random predictor (but not significant)
True positive rates
Found some evidence for a relation between violations and faults
In both cases, but especially in TVoM
At this point, no pattern of rules stands out
However, no consistent behavior of rules for the two cases
More cases are needed to increase confidence in results
A priori rule selection currently not possible
Temporal method easier to apply, but some problems:
Inaccurate estimate of number of faults
Too sensitive to changes other than fault-fixes
Lessons learned
Write a C++ style comment for every fault fix
Other (non-fix) modifications might obscure the correlation
Spatial methods may be too restrictive
Not all modified/deleted lines in a fault-fix are faulty
Sometimes fault-fixes only introduce new code; unable to locate fault
Must take care in selection of codebase to analyze
Preliminary in-release results indicate no correlation
Lessons learned
Note that there may be more reasons than fault prevention to adhere to a coding standard
Maintainability: readability, common style
Portability: minimize issues due to compiler changes
Nevertheless, quantification of fault prevention can be an important asset in the cost-benefit analysis of adherence
You may have noticed not all results were in
the slides: work in progress!
Final remarks