benchmarking effectiveness for object-oriented unit testing anthony j h simons and christopher d...
TRANSCRIPT
BenchmarkingEffectiveness
for Object-Oriented Unit Testing
Anthony J H Simons and Christopher D Thomson
Overview
Measuring testing? The Behavioural Response Measuring six test cases Evaluation of JUnit tests Evaluation of JWalk tests
http://www.dcs.shef.ac.uk/~ajhs/jwalk/
Analogy: Metrics and Testing
Things easy to measure (but why?)– metrics: MIT O-O metrics (Chidamber & Kemmerer)– testing: decision-, path-, whatever-coverage– testing: count exceptions, reduce test-set size
Properties you really want (but how?)– metrics: Goal, Question, Metric (Basili et al.)– testing: e.g. mutant killing index– testing: effectiveness and efficiency?
Measuring Testing?
Most approaches measure testing effort,
rather than test effectiveness!
Degrees of Correctness
Suppose an ideal test set– BR : behavioural response (set)– T : tests to be evaluated (bag – duplicates?)
– TE = BR T : effective tests (set)
– TR = T – TE : redundant tests (bag)
Define test metrics– Ef(T) = (|TE | – |TR |) / |BR| : effectiveness
– Ad(T) = |TE | / |BR| : adequacy
Ideal Test Set?
The ideal test set must verify each distinct
response of an object!
What is a Response?
Input response– Account.withdraw(int amount) : 3 partitions
• amount < 0 fail precondition, exception
• amount > balance refuse, no change
• amount <= balance succeed, debit
State response– Stack.pop() : 2 states
• isEmpty() fail precondition, exception
• ! isEmpty() succeed
Behavioural Response – 1
Input response– c.f. exemplars of equivalence partitions– max responses per method, over all states
State response– c.f. state cover, to reach all states– max state-contingent responses, over all methods
Behavioural Response– product of input and state response– checks all argument partitions in all states– c.f. transition cover augmented by exemplars
Behavioural Response – 2
Parametric form: BR(x, y)– stronger ideal sets, for higher x, y
x = length of sequences from each statey = number of exemplars for each partition
Redundant states– higher x rules out faults hiding in duplicated states
Boundary values– higher y verifies equivalence partition boundaries
Useful measure– precise quantification of what has been tested– repeatable guarantees of quality after testing
Compare Testing Methods
JWalk – “Lazy systematic unit testing method”
JUnit – “Expert manual unit
testing method”
JUnit – Beck, Gamma
“Automates testing”– manual test authoring (as good as human expertise)– may focus on positive, miss negative test cases– saved tests automatically re-executed on demand– regression style may mask hard interleaved cases
Test harness– bias: test method “testX” for each method “X” – each “testX” contains n assertions = n test cases– same assertions appear redundantly in “testY”, “testZ”
JWalk – Simons
Lazy specification– static analysis of compiled code– dynamic analysis of state model– adapts to change, revises the state model
Systematic testing– bounded exhaustive state-based exploration– may not generate exemplars for all input partitions– semi-automatic oracle construction (confirm key values)– learns test equivalence classes (predictive testing)– adapts existing oracles, superclass oracles
Six Test Cases
Stack1 – simple linked stack Stack2 – bounded array stack
– change of implementation
Book1 – simple loanable book Book2 – also with reservations
– extension by inheritance
Account1 – with deposit/withdraw Account2 – with preconditions
– refinement of specification
Instructions to Testers
Test each response for each class, similar to the transition cover, but with all equivalence
partitions for method inputs
Behavioural Response
Test Class API Input R State R BR(1,1)
Stack1 6 6 2 12
Stack2 7 7 3 21
Book1 5 5 2 10
Book2 9 10 4 40
Account1 5 6 2 12
Account2 5 9 2 18
ideal test target
JUnit – Expert Testing
Test Class T TE TR Ad(T) Ef(T) time
Stack1 20 12 8 1.00 0.33 11.31
Stack2 23 16 7 0.76 0.43 +14.00
Book1 31 9 22 0.90 -1.30 11.00
Book2 104 21 83 0.53 -1.55 +20.00
Account1 24 12 12 1.00 0.00 14.37
Account2 22 17 5 0.94 0.67 08.44
massive generationstill not effective
JWalk – Test Generation
Test Class T TE TR Ad(T) Ef(T) time
Stack1 12 12 0 1.00 1.00 0.42
Stack2 21 21 0 1.00 1.00 0.50
Book1 10 10 0 1.00 1.00 0.30
Book2 36 36 0 0.90 0.90 0.46
Account1 12 12 0 1.00 1.00 1.17
Account2 17 17 0 0.94 0.94 16.10
no wasted testsmissed 5 inputs
Comparisons
JUnit: expert manual testing– massive over-generation of tests (w.r.t. goal)– sometimes adequate, but not effective– stronger (t2, t3); duplicated; and missed tests– hopelessly inefficient – also debugging test suites!
JWalk: lazy systematic testing– near-ideal coverage, adequate and effective– a few input partitions missed (simple generation strategy)– very efficient use of the tester’s time – sec. not min.– or: two orders (x 1000) more tests, for same effort
Conclusion
Behavioural Response– seems like a useful benchmark (scalable, flexible)– use with formal, semi-formal, informal design methods– measures effectiveness, rather than effort
Moral for testing– don’t hype up automatic test (re-)execution– need systematic test generation tools– automate the parts that humans get wrong!
Any Questions?
http://www.dcs.shef.ac.uk/~ajhs/jwalk/
Put me to the test!