debugging concurrent software by context-bounded analysis shaz qadeer microsoft research joint work...
Post on 21-Dec-2015
242 views
TRANSCRIPT
Debugging Concurrent Software by Context-Bounded
AnalysisShaz Qadeer
Microsoft Research
Joint work with:•Jakob Rehof, Microsoft Research•Dinghao Wu, Princeton University
Concurrent software
•Operating systems, device drivers•Databases, web servers, browsers, GUIs, ...•Modern languages: C#, Java
Processor 1
Processor 2
Thread 1
Thread 2
Thread 3
Thread 4
Concurrency is increasingly important
• Single-chip multiprocessors are an architectural inflexion point– Software running on these chips will be
even more concurrent
• Embedded systems– Airplanes, cars, PDAs, cellphones
• Web services
Reliable concurrent software?
•Correctness Problem– does program behave correctly for all
inputs and all interleavings?
•Bugs due to concurrency are insidious – nondeterministic, timing dependent– difficult to detect, reproduce, eliminate– coverage from testing very poor
Analysis of concurrent programs is difficult (1)
• Finite-data single-procedure program– n lines– m states for global data variables
• 1 thread– n * m states
• K threads– (n)
K * m states
Analysis of concurrent programs is difficult (2)
• Finite-data program with procedures– n lines– m states for global data variables
• 1 thread– Infinite number of states– Can still decide assertions in O(n * m3)– SLAM, ESP, BLAST implement this algorithm
• K 2 threads– Undecidable! (Ramalingam 00)
Context-bounded verification of concurrent software
Context Context Context
Context switch Context switch
Analyze all executions with small number of context switches !
• Many subtle concurrency errors are manifested in executions with a small number of contexts
• Context-bounded analysis can be performed efficiently
Why context-bounded analysis?
KISS: A static checker for concurrent software
• An implementation of context-bounded analysis– Technique to use any sequential checker
to perform context-bounded concurrency analysis
• Has found a number of concurrency errors in NT device drivers
Sequentialprogram QKISS
Sequential Checker
Concurrentprogram P
No error found
Error in Q indicateserror in P
KISS: A static checker for concurrent software
Sequentialprogram QKISS
Concurrentprogram P
KISS: A static checker for concurrent software
No error found
Error in Q indicateserror in P
SDV
Sequentialprogram QKISS
Concurrentprogram P
KISS: A static checker for concurrent software
No error found
Error in Q indicateserror in P
PREfix
Sequentialprogram QKISS
Concurrentprogram P
KISS: A static checker for concurrent software
No error found
Error in Q indicateserror in P
ESP
Inside a static checker for sequential programs
int x, y, z;
void foo ( ) { if (x > y) { y = x; } if (y > z) { z = y; }
assert (x ≤ z);}
• Symbolically analyze all paths
• Check the assertion for each path
• Interprocedural analysis – e.g., PREfix, ESP, SLAM,
BLAST
KISS strategy
• Q encodes executions of P with small number of context switches– instrumentation introduces lots of extra paths to
mimic context switches
• Leverage all-path analysis of sequential checkers
Sequentialprogram QKISS
Concurrentprogram P
SDV
PnpStop( ) { int t; de->stopping = T; t = AtomicDecr(& de->count); if (t == 0) SetEvent(& de->stopEvent); WaitEvent(& de->stopEvent);
}
DispatchRoutine( ) { int t; if (! de->stopping) { AtomicIncr(& de->count); // do useful work // … t = AtomicDecr(& de->count); if (t == 0) SetEvent(& de->stopEvent);
}}
DispatchRoutine( ) { int t; if (! de->stopping) { AtomicIncr(& de->count); // do useful work // … t = AtomicDecr(& de->count); if (t == 0) SetEvent(& de->stopEvent);
}}
PnpStop( ) { int t; if ($) return; de->stopping = T; if ($) return; t = AtomicDecr(& de->count); if ($) return; if (t == 0) SetEvent(& de->stopEvent); if ($) return; WaitEvent(& de->stopEvent);
}
PnpStop( ) { int t; if ($) return; de->stopping = T; if ($) return; t = AtomicDecr(& de->count); if ($) return; if (t == 0) SetEvent(& de->stopEvent); if ($) return; WaitEvent(& de->stopEvent);
}
DispatchRoutine( ) { int t; CODE; if (! de->stopping) { CODE; AtomicIncr(& de->count); // do useful work // … CODE; t = AtomicDecr(& de->count); CODE; if (t == 0) SetEvent(& de->stopEvent); CODE; }}
if ( !done ) { if ($) { done = T; PnpStop( ); }}
CODE bool done = F;
PnpStop( ) { int t; if ($) return; de->stopping = T; if ($) return; t = AtomicDecr(& de->count); if ($) return; if (t == 0) SetEvent(& de->stopEvent); if ($) return; WaitEvent(& de->stopEvent);
}
DispatchRoutine( ) { int t; CODE; if (! de->stopping) { CODE; AtomicIncr(& de->count); // do useful work // … CODE; t = AtomicDecr(& de->count); CODE; if (t == 0) SetEvent(& de->stopEvent); CODE; }}
if ( !done ) { if ($) { done = T; PnpStop( ); }}
CODE bool done = F;
main( ) { DispatchRoutine( ); }
PnpStop( ) { int t; CODE; de->stopping = T; CODE; t = AtomicDecr(& de->count); CODE; if (t == 0) SetEvent(& de->stopEvent); CODE; WaitEvent(& de->stopEvent); CODE;}
DispatchRoutine( ) { int t; if ($) return; if (! de->stopping) { if ($) return; AtomicIncr(& de->count); // do useful work // … if ($) return; t = AtomicDecr(& de->count); if ($) return; if (t == 0) SetEvent(& de->stopEvent);
}}
if ( !done ) { if ($) { done = T; PnpStop( ); }}
CODE bool done = F;
main( ) { PnpStop( ); }
KISS features• KISS trades off soundness for scalability • Cost of analyzing a concurrent program P =
cost of analyzing a sequential program Q– Size of Q asymptotically same as size of P
• Unsoundness is precisely quantifiable– for 2-thread program, explores all executions
with up to two context switches – for n-thread program, explores up to 2n-2
context switches
• Allows any sequential checker to analyze concurrency
Driver Stopping Error in Bluetooth Driver (1 KLOC)
DispatchRoutine() { int t; if (! de->stopping) { AtomicIncr(& de->count); assert ! driverStopped; // do useful work // … t = AtomicDecr(& de->count); if (t == 0) SetEvent(& de->stopEvent); }}
PnpStop() { int t; de->stopping = T; t = AtomicDecr(& de->count); if (t == 0) SetEvent(& de->stopEvent); WaitEvent(& de->stopEvent); driverStopped = T;}
int t;if (! de->stopping) {
int t;de->stopping = T;t = AtomicDecr(& de->count);if (t == 0) SetEvent(& de->stopEvent);WaitEvent(& de->stopEvent);driverStopped = T;
AtomicIncr(& de->count); assert ! driverStopped; // do useful work // … t = AtomicDecr(& de->count); if (t == 0) SetEvent(& de->stopEvent);}
Assertion fails!
DispatchRoutine(IRP *irp) { … irp->CancelRoutine = PacketCancelRoutine; Enqueue(irp); IoMarkIrpPending(irp); …}
IoCancelIrp(IRP *irp) { IoAcquireCancelSpinLock(); if (irp->CancelRoutine) { (irp->CancelRoutine)(irp); } …}
PacketCancelRoutine(IRP *irp) { … Dequeue(irp); IoCompleteRequest(irp); IoReleaseCancelSpinLock(); …}
IRP Cancellation Error in Packet Driver (2.5 KLOC)
…irp->CancelRoutine = PacketCancelRoutine;Enqueue(irp);
IoAcquireCancelSpinLock();if (irp->CancelRoutine) { // inline PacketCancelRoutine(irp) … Dequeue(irp); IoCompleteRequest(irp); IoReleaseCancelSpinLock();
IoMarkIrpPending(irp);
Error: An irp should not be marked pending after it has been completed !
Data-race Conditions in DDK Sample Drivers
• Device extension shared among threads• Data-races on device extension fields• 18 sample DDK drivers
– Range 0.5-9.2 KLOC– Total 70 KLOC
• Each field checked separately with resource limit of 20 minutes and 800MB
• Two threads: each calls nondeterministically chosen dispatch routine
Driver KLOC # Fields # Races
Tracedrv 0.5 3 0
Moufiltr 1.0 14 0
Kbfiltr 1.1 15 0
Imca 1.1 5 1
Startio 1.1 9 0
Toaster/toastmon 1.4 8 1
Diskperf 2.4 16 0
1394diag 2.7 18 0
1394vdev 2.8 18 1
Fakemodem 2.9 39 6
Toaster/bus 5.0 30 0
Serenum 5.9 41 2
Toaster/func 6.6 24 5
Mouclass 7.0 34 1
Kbdclass 7.4 36 1
Mouser 7.6 34 1
Fdc 9.2 92 9
Total:30 races
ToastMon_DispatchPnp(DEVICE_OBJECT *obj,IRP *irp)
{ … IoAcquireRemoveLock(); … case IRP_MN_QUERY_STOP_DEVICE: // Race: write access deviceExt->DevicePnPState = StopPending; … break; … IoReleaseRemoveLock(); …}
ToastMon_DispatchPower(DEVICE_OBJECT *obj,IRP *irp)
{ … // Race: read access if (deviceExt->DevicePnpState == Deleted) { … } …}
DevicePnpState Field in Toaster/toastmon
Acknowledgments
• Tom Ball• Byron Cook• John Henry• Doron Holan• Vladimir Levin• Jakob Lichtenberg• Adrian Oney• Sriram Rajamani• Peter Wieland• …
Keep It Simple and Sequential
• Context-bounded analysis by leveraging existing sequential checkers
• Validates the hypothesis that many concurrency errors require few context switches to show up
However…
• Hard limit on number of explored contexts– e.g., two context switches for concurrent
program with two threads
• Case study: Concurrent transaction management code written in C# (Naik-Rehof 04)– Analyzed by the Zing model checker after
automatically translating to the Zing input language
– Found three bugs each requiring between three and four context switches
Is a tuning knob possible?
Given a concurrent boolean program P and a positive integer c, does P go wrong by failing an assertion via anexecution with at most c contexts?
Given a concurrent boolean program P, does P go wrong by failing an assertion? Undecidable
Decidable
Given a concurrent boolean program P with unbounded fork-join parallelism and a positive integer c, does P go wrong by failing an assertion via an execution with at most c contexts? Decidable
Context Context Context
Context switch Context switch
Problem:• Unbounded computation possible within each context!• Unbounded execution depth and reachable state space• Different from bounded-depth model checking
Global store g, valuation to global variablesLocal store l, valuation to local variables Stack s, sequence of local storesState (g, s)
Sequential boolean program
Example
bool a = F;
void main( ) {L1: a = T;L2: flip(a);L3: }
void flip(bool x) {L4: a = !x;L5: }
(F, )
(F, _, L3)
(F, _, L3 T, L5)
(T, _, L3 T, L4)
(T, _, L2)
(F, _, L1)
(a, x, pc)
Global store g, valuation to global variablesLocal store l, valuation to local variables Stack s, sequence of local storesState (g, s)
Sequential boolean program
Transition relation:
(g, s) (g’, s’)
Reachability problem for sequential boolean program
Given (g, s), is there s’ such that (g, s) * (error,s’)?
Aggregate state
Set of stacks ssAggregate state (g, ss) = { (g,s) | s ss }
Reach(g, ss) = { (g’, s’) | exists s ss such that (g, s) * (g’, s’) }
Aggregate transition relation
Observations: • There is a unique smallest partition of Reach(g, ss) into aggregate states: (g’1, ss’1) … (g’n, ss’n) • The number of elements in the partition is bounded by the number of global stores
(g, ss) (g’1, ss’1)...
(g, ss) (g’n, ss’n)
Theorem (Buchi, Schwoon00)
• If ss is regular and (g, ss) (g’, ss’), then ss’ is regular.
• If ss is given as a finite automaton A, then a finite automaton A’ for ss’ can be constructed from A in polynomial time.
Algorithm
Solution:Compute automaton for ss’ such that (g, {s}) (error, ss’) and check if ss’ is nonempty.
Problem:Given (g, s), is there s’ such that (g, s) * (error,s’)?
Global store g, valuation to global variablesLocal store l, valuation to local variables Stack s, sequence of local storesState (g, s1, s2)
Concurrent boolean program
Transition relation:
(g, s1) (g’, s’1) in thread 1
(g, s1, s2) 1 (g’, s’1, s2)
(g, s2) (g’, s’2) in thread 2
(g, s1, s2) 2 (g’, s1, s’2)
Reachability problem for concurrent boolean program
Given (g, s1, s2), are there s’1 and s’2 such that (g, s1, s2) reaches (error, s’1, s’2) via an execution with at most c contexts?
Aggregate transition relation
(g, ss1) (g’, ss’1) in thread 1
(g, ss1, ss2) 1 (g’, ss’1, ss2)
(g, ss1, ss2) 2 (g’, ss1, ss’2)
(g, ss2) (g’, ss’2) in thread 2
(g, ss1, ss2) = { (g, s1, s2) | s1 ss1, s2 ss2 }
Algorithm: 2 threads, c contexts
1 2
1 2
1 2Depth c
(g, {s1}, {s2})
Compute the set of reachable aggregate states.Report an error if (g, ss1, ss2) is reachable andg = error, ss1 is nonempty, and ss2 is nonempty.
Complexity: 2 threads, c contexts
1 2
1 2
1 2
Depth of tree = context bound cBranching factor bounded by G 2 (G = # of global stores)Number of edges bounded by (G 2) (c+1)
Each edge computable in polynomial time
Depth c
(g, {s1}, {s2})
Context-bounded analysis of concurrent software
• Many subtle concurrency errors are manifested in executions with few context switches – Experience with KISS on Windows drivers– Experience with Zing on transaction manager
• Algorithms for context-bounded analysis are more efficient than those for unbounded analysis– Reducibility to sequential checking with KISS– Decidability of assertion checking for
concurrent boolean programs
Applications of context-bounded analysis
• Coverage metric for testing concurrent software
• Analysis of computer protocols– networking– cache-coherence
Unbounded fork-join parallelism
• Fork operation: x = fork• Join operation: join(x)• Copy thread identifier from one
variable to another
Algorithm: unbounded fork-join parallelism, c contexts
• At most c threads may perform a transition
• Reduce to previously solved problem with c threads and c contexts– Nondeterministically pick c forked
threads for execution
start : {1, …, c} boolean, initialized to i. (i == 1) end : {1, …, c} boolean, initialized to i. false
x = fork translates to
if ($) { assume(count < c); count = count + 1; x = count; start[count] = true;} else { x = c + 1;}
join(x) translates to
assume(x c);assume(end[x]);
count : {1, …, c}, initialized to 1
• c statically created threads• thread i starts execution when start[i] is true • thread i sets end[i] to true on termination