TestRank:Eliminating Waste from
Test-Driven Development
Presented by Hagai CibulskiTel Aviv University
Advanced Software Tools Research Seminar 2010
Elevator Pitch
TDD bottleneck - repeated runs of an ever growing test suite
Less productivity Casualness in following TDD
Loss of Quality
TestRank – finds appropriate tests to run after given code edits
Run a fraction of the tests in each cycle: eliminate waste + high bug detection rate
Agenda
In this talk we will:learn Test-Driven Development (TDD) in two minutesobserve insights into the nature of TDD testsidentify the TDD bottleneckdefine the Regression Test Selection (RTS) problem for the TDD contextreview past work on RTSsee alternative Program Analysis techniques:
Dynamic PANatural Language PA
present TestRank – an RTS tool for TDD
Test-Driven Development
Agile software development methodology
Short development iterations
Pre-written test cases define functionality
Each iteration: code to pass that iteration's tests
Test-Driven Development Cycle
Repeat:Add a test
Run tests and see the new one fails
Write some code
Run tests and see them succeed
Refactor code
Run tests and see them succeed
TDD Tests - Observations
TDD tests define functionality
TDD code is highly factored
Therefore:A single test may cross multiple units of code
A single code unit implements functionalities defined in multiple tests
Test Suite - Observations
Tests are added over time5 developers x 1 test a day x 240 days = 1200 tests
1200 tests x 200 mSec = 4 minutes
Integrated into nightly/integration buildsCommited changes are covered nightly/continuously
Integrated into the team developers' IDEProgrammers can run isolated tests quickly
The Motivation:Early detection of software bugs
A developer edits a block of code
Using strict unit tests as a safety netFinding the unit test to run is straightforward (1-1 or 1-n)
Using TDD functional tests as a safety netFinding the tests to run? (n-n)
Where is the code-test correlation?
Must run the entire test suite?Might not be cost effective
Delay of running the entire test suite?Delays the detection of software bugs
Bugs become harder to diagnose the further
the symptom is removed from the cause
TestRank Problem Definition
GivenP – program under test
T – test suite (assuming all tests in T pass on P)
Q – query about location (method) L
Find:Ranking: t1, t2, …, tn
s.t. if change in L causes test ti to fail, i is minimal
Application:Select top (e.g. 20%) of ranking
Goal: Achieve 1- bug detection, s.t. is minimal
TestRank – Application
Rank the tests such that running the top 20% ranked tests will reveal a failing test with 80% probability
20% is our promise to the developer
80% is justified assuming eventually all tests will be run
Usually new bugs first chance of being detected is when all the tests are run (typically on the nightly build)
Don't waste time reconciling which (or whose) coding changes are responsible for new bugs
The bugs never get checked into the master source
Related Work
Past Work on Test Suite OptimizationTest Selection
Lower total cost by selecting an appropriate subset of the existing test suite based on information about the program, modified version, and test suite
Usually conservative ("safe") analyses
Test PrioritizationSchedule test cases in an order that increases rate of fault detection (e.g. by decreasing coverage delta)
TestTubeTestTube: a system for selective regression testing
Chen, Rosenblum and Vo, 1994 “Safe” RTS: Identify all global entities that test t covers
Assumes deterministic systemCoarse level of granularity – C functionsInstrumentation: t {fi}closure(f) = global vars, types, and macros used by fA test case t in T is selected for retesting P' if:
diff(P,P’) ∩ closure(trace(t)) ØReduction of 50%+ in number of test cases
Only in case of “feature functions”..
feature functions
core functions
Nondeterministic version - "transitive closure" technique 0% reduction in “core functions”
DejaVuA safe, efficient regression test selection technique
Rothermel and Harrold, 1997Conservative + Granularity at statement level
improving precisionControl flow based
CFG for each procedureInstrumentation: t {e} Simultaneous DFS on G,G' for each procedure and its modified version in P,P’A test case t in T is selected for retesting P' if its execution trace contains a “dangerous” edgeA lot of work goes into calculating diff(P,P’)
Might be too expensive to be used on large systems
Results: Two studies found average reduction of 44.4% and 95%
DejaVOO - Two Phase Technique
A comparative study [Bible, Rothermel & Rosenblum. 2001] found TestTube/DejaVu exhibit trade-off of efficiency versus precision
Analysis-time + test-execution-time ∝ const
Scaling Regression Testing to Large Software SystemsOrso, Shi, and Harrold (DejaVu). 2004
JBoss = 1MLOCEfficient approach: Selected too many testsPrecise approach: Analysis took too much timeIn each case: Analysis + Execution > naïve-retest-all
Implementing a technique for Java programs that is “safe”, precise, and yet scales to large systems
Phase #1: Fast, high-level analysis to identify the parts of the system that may be affected by the changesPhase #2: Low-level analysis of these parts to perform precise test selection
DejaVOO Results
Considerable increase in efficiency
Same precision
TestTube / DejaVuTestRank
Testing phaseImplementation phase (TDD)
input is a version to be tested
- after everyone checked in
look at a single block of code – before check in
ConservativeTestTube: Low precision
DejaVu: Higher precision -
(still “safety” is overrated)
High precision is the goal –
(sometimes not reporting may be OK)
Standard RTS vs. RTS for TDD
Commercial/Free ToolsGoogle Testar: Selective testing tool for Java
Works with JUnitRecords coverage by instrumenting bytecode
Clover’s “Test Optimization”A coverage tool with a new test optimization featureSpeed up CI buildsLeverages "per-test" coverage data for selective testing
JUnitMax by Kent BeckA continuous test runner for Eclipse Supports test prioritization to encourage fast failuresRun short tests firstRun recently failed (and newly written) tests first
JTestMe: Another selective testing tool for JavaUses AspectJ, method-level coverage
Infinitest: a continuous test runner for JUnit testsWhenever you make a change, Infinitest runs tests for you. It selects tests intelligently, and runs the ones you need. Uses static analysis will not work with dynamic/reflection-based invocations
CodePshychologist
Locating Regression Bugs Nir, Tyszberowicz and Yehudai, 2007
Same problem in reverse
Given check point C that failed, and source code S of the AUT, find the places (changes) in the code S that causes C to fail
System testing
UI level
Using script/manual
Checkpoint C defined at UI level
CodePshychologist - Code lines affinity
Check point
Select "clerk 1" from the clerk tree (clerk number 2). Go to the next clerk.The next clerk is "clerk 3"
CodePshychologist - affinity problem
red, flower, white, black,
cloud
rain, green, red, coat
>red, flower,
white, black, cloud
train, table, love
CodePshychologist - Words affinity
Taxonomy of words, graph where each node represents a synonym setWordnet: An electronic lexical database. 1998 Wordnet-based semantic similarity measurement. Simpson & Dao, 2005
CodePshychologist – Word Groups affinity
TestRank Marketecture
P
T
Dynamic & Static Analyses
Correlation Scores
&Locator
Qfile:line
QueryEngine
Ranking
t1t2t3…
TestRank - Preprocessing Phase
Pre-compute test/unit correlation during a run of the test suite by tracing the tests through the production code
AspectJ
Use coverage as basic soundness filterCollect dynamic metrics:
Control flowData flow
Look for natural language clues in sources textWordNet
TestRank – Online Phase
Use correlation data during code editing to expose to the developer a list of tests which might conflict with the block of code currently being edited
Sorted in descending order of correlation
Developers can run just the specific functional tests
Dynamic PA
Execution Count PredictorHow many times this method was called during the execution stemming from test t?
Call Count PredictorHow many distinct calls to this method when called during the execution stemming from test t?
Normalize to [0, 1]score = c / (c+1)
More Dynamic PA – The Stack
Two Stack Frames Count PredictorHow many distinct configurations of the calling frame and frame before that on the call stack?
Stack Count PredictorHow many distinct call stack configurations?
Stack Depth Sum PredictorSum the inverse depth of call stack at each execution of this method stemming from test t.
Dynamic PA – Data Flow
Value Propagation PredictorCompare values of simple typed arguments (and return value), between those flowing out of the test and those reaching the method under test.
Size of intersection between the two sets of values.
For each test/method pair, find the maximum intersection m.
score = m / (m+1)
Natural Language PA
Adapted CodePsychologist Algorithm.
Coverage sound filter (execution count > 0).
For each test/method pair, look for:Similar methodName()“Similar literals”
// Similar comments
Similar words extracted from meaningfulIdentifierNames
NL analysis
During tracing build SourceElementLocator:fileName beginLine ElementInfo{signature, begin, end,
WordGroup}
For each source file, extract words and literals and map them by line numbers.
Literals are whole identifiers, strings and numbersWords are extracted from identifiers by assuming_namingConventions {assuming, naming, conventions}
For each method:include the comments before the methodcollect group of words and literals mapped to line numbers between the beginning and the end of the method
NLPA – Word Group Affinity
for each test/method pair (t, m), locate the two code elements and get two word groups wg(t), wg(m)
calculate GrpAff(wg(t), wg(m)) using adapted CodePsychologist algorithm:
separate words from literals, compute GrpAff for each type separately and take the average affinity.
filter out 15% most common words in the text.
TF-IDF
TermFrequency x InverseDocumentFrequency
Balances the relative frequency of the word on particular method, with its overall frequency
w occurs nw,p times in a method p and there are a total of Np terms on the method
w occurs on dw methods and there are a total of D methods in the traces
tfidf(w, p) = tf (w, p) × idf(w) = nw,p/Np * log D/dw
NLPA – Weighted Group Affinity
AsyGrpAff’(A,B) = 1/n · Σ1≤i≤n
[max{WrdAff(ai, bj) | 1 ≤ j ≤ m} · tfidf2(ai, A) · factor(*)(ai)]
(*) Words appearing in method name are given a
x10 weight factor
Synthetic Experiment – Code Base
Log4J
Apache’s open source logging project
33.3KLOC
8.4K statements
252 test methods
Used CoreTestSuite = 201 test methods
1,061 actual test/method pairs traced
Synthetic Experiment – Performance
CPU: Intel Core2-6320 1.86GHz; RAM: 2Gb
Preprocessing:Dynamic PA ≈ 6 sec
Natural Language PA ≈ another 12 sec
Creates two database files:affinities.ser ~1.1Mb
testrank.ser ~2Mb
Query < 1 sec
Synthetic Experiment – Method
identified "core methods" covered by 20-30 tests each.manually mutated four methods in order to get a test failing.got ten test failures
getLoggerRepository {testTrigger, testIt}setDateFormat {testSetDateFormatNull, testSetDateFormatNullString}getRenderedMessage {testFormat, testFormatWithException…. 3 more}getLogger {testIt}
LogManager.getLoggerRepository covered by 30 testsPlanted Bug: Removed the “if” condition:// if (repositorySelector == null) { repositorySelector = new DefaultRepositorySelector(new NOPLoggerRepository()); guard = null; LogLog.error("LogMananger.repositorySelector was null likely due to error in class reloading.");// } return repositorySelector.getLoggerRepository();
actual:Errors:
SMTPAppenderTest.testTriggerFailures:
TelnetAppenderTest.testIt
Synthetic Experiment – Method (2)
pairs (mi, ti) of mutated method and actual failing test
e.g. input file (descriptor of 3 such pairs)
actual_1.txtLogManager.java:174
LoggerRepository org.apache.log4j.LogManager.getLoggerRepository()
void org.apache.log4j.net.SMTPAppenderTest.testTrigger()
void org.apache.log4j.net.TelnetAppenderTest.testIt()
Synthetic Experiment – Method (3)
reverted all mutations back to original code.
ran TestRank preprocessing
for each pair (mi, ti) ran a query on mi and compared ti to the each of TestRank predictor's ranking (ti1, ti2, …, tim).
For predictor p, let the actual failed test's relative rank be RRp = j/m, where ij=i.
Synthetic ExperimentPredictors Results
BugSafeRTSDynamic Heuristics
NL Heuristics
MetaHeuristics
ExecutionCount
CallCount
2 Stack Frames
StackCount
Stack DepthSum
ValuePropagationAffinity
SimpleAverage
#130822222032
138881671017
#22255554887
55554386
#3*2111-164-145-156-1511-163-131-46-14
#41744444434Different heuristics predicted different failures
Best heuristics: Value Propagation, Affinity
Stack Depth Sum was very good on 4 experiments, and among the worst on the other 6
Worst heuristic: Execution Count(*) Bug #3 caused five tests to fail
Synthetic ExperimentPredictors Statistics
Improvement vs. “Safe” RTS
StatisticDynamic HeuristicsNL Heuristics
MetaHeuristics
ExecutionCount
CallCount
2 Stack Frames
StackCount
Stack DepthSum
ValuePropagationAffinity
SimpleAverage
Average45.8%28.8%31.2%32.6%42.9%30.6%18.6%37.9%
Median43.3%23.5%23.8%26.7%52.4%26.7%10%28.6%
80%61.9%28.6%33.3%38.1%57.1%38.1%33.3%56.7%
Affinity is best on Average and Median
Call Count is best on 80% percentile
Worst heuristics: Execution Count, Stack Depth Sum
Simple Average is a bad meta heuristic
Conclusions
TDD is powerful but over time introduces waste in the retesting phase“Safe” RTS techniques are too conservative for TDD(and are not really safe…)
Our technique enables to find and run the tests most relevant to a given code changeDynamic and natural-language analyses are keyDevelopers can run the relevant tests and avoid wasting time on running the irrelevant onesEliminate waste from the TDD cycle while maintaining a high bug detection rateMakes it easy to practice TDD rigorously
(Near) Future Work
We are currently working on:Affinity Propagation through call tree
Meta heuristicsWeighted average
Use experimental results as training data?
Self weighting heuristics
Further validation
Future Work
Reinforcement LearningStrengthen correlation between true positives and weaken for false positivesInteractive confirm/deny
Add annotations/taggingFiner granularity Greater precisionString edit distance between literalsConsider external resources
Changes in files such as XML and propertiesCombine global ranking
Test cyclomatic complexity / test code sizeUse timing of tests for cost effective ranking (short tests rank higher)
Selection should have good total coverageHandle multiple editsIntegration with Eclipse and JUnitChanges filtering: comments, refactoring, dead code
combine static analysis (combine existing tools)
Further Applications of Code/Test Correlation
Assist code comprehensionWhat this code does?
Assist test maintenance What is the sensitivity/impact of this code?What tests to change?
Find regression to known past bugsrelated bug descriptions in bug tracking system
Reverse applicationsFind regression cause: Test fails ? where to fix (CodePsychologist++)Find bug cause: Find relevant code to bug description in bug tracking systemTDD impl assist: Spec (test) change? where to implement
Questions?
Thank You
How Often?
Quote from JUnit FAQ:http://junit.sourceforge.net/doc/faq/faq.htm
How often should I run my tests?
Run all your unit tests as often as possible, ideally every time the code is changed. Make sure all your unit tests always run at 100%. Frequent testing gives you confidence that your changes didn't break anything and generally lowers the stress of programming in the dark.
For larger systems, you may just run specific test suites that are relevant to the code you're working on.
Run all your acceptance, integration, stress, and unit tests at leastonce per day (or night).
How much time?
We posted a question on stackoverflow.comhttp://stackoverflow.com/questions/1066415/how-much-time-do-you-spend-running-regression-tests-on-your-ide
How much time do you spend running regression tests on your IDE, i.e. before check-in?
In most cases these test will run in < 10 seconds. To run the complete test suite I rely on the Hudson Continuous Integration server... (within an hour).
sometimes I run a battery of tests which takes an hour to finish, and is still far from providing complete coverage.
My current project has a suite of unit tests that take less than 6 seconds to run and a suite of system tests that take about a minute to run.
I would generally run all my tests once per day or so, as, in one job, I had about 1200 unit tests.
Assumptions
Baseline - All tests in T pass on P
Change is localized to a single method
We currently ignore some possible inputs:Source control history
Test results history
Test durations
Recently failed/ added/changed tests
DejaVu Results
“Siemens study” [Hutchins 1994] Set of 7 small, nontrivial, real C programs
141-512 LOC, 8-21 procedures, 132 faulty versions, 1000-5500 tests
44.4% average reduction
“Player”A worker handing one player in internet game 'Empire‘
766 procedures 50kloc, 5 versions, 1000 tests (same command with different parameters)
95% average reduction!