d 3 s: debug deployed distributed systems xuezheng liu, zhenyu guo, xi wang, feibo chen, xiaochen...
Post on 03-Jan-2016
218 Views
Preview:
TRANSCRIPT
D3S: Debug Deployed Distributed Systems
Xuezheng Liu, Zhenyu Guo, Xi Wang, Feibo Chen, Xiaochen Lian, Jian Tang, Ming Wu, M. Frans Kaashoek, Zheng
Zhang
Microsoft Research Asia, Tsinghua University, Fudan University, Shanghai Jiaotong University, MIT CSAIL
Debugging distributed systems is difficult
• Bugs are difficult to reproduce– Many machines executing concurrently– Machines may fail– Network may fail
Example: Distributed lock
• Distributed reader-writer locks– Lock mode: exclusive, shared– Invariant: only one client can hold a lock in the
exclusive mode
• Debugging is difficult because the protocol is complex– For performance, clients cache locks– For failure tolerance, locks have a lease
State-of-the-art of runtime checking
Step 1: add logsvoid ClientNode::OnLockAcquired(…) { … print_log( m_NodeID, lock, mode);}
Step 2: Collect logs, align them into a globally consistent sequence• Keep partial order
Step 3: Write checking scripts• Scan the logs to retrieve lock states• Check the consistency of locks
Problems for large/deployed systems
• Too much manual effort
• Difficult to anticipate what needs to log– Too much information: slow systems down– Too little information: miss a problem
• Checking for large system is challenging– A central checker cannot keep up– Snapshots must be consistent
• Our focus: make runtime checking easier and feasible for deployed/large-scale system
D3S approach
Checker Checker
Predicate:no conflict locks
Violation!
statestate
state
statestate
Conflict!
Our contributions/outline
• A simple language for writing distributed predicates
• Programmers can change what is being checked on-the-fly
• Failure tolerant consistent snapshot for predicate checking
• Evaluation with five real-world applications
Design goals
• Simplicity: a sequential style for writing predicates• Parallelism: run in parallel on multiple checkers• Correctness: check consistent states in spite of
failures
• Solution– MapReduce model – Failure-tolerant consistent snapshot
Developers write a D3S predicate
V0: exposer { ( client: ClientID, lock: LockID, mode: LockMode ) }V1: V0 { ( conflict: LockID ) } as finalafter (ClientNode::OnLockAcquired) addtuple ($0->m_NodeID, $1, $2)after (ClientNode::OnLockReleased) deltuple ($0->m_NodeID, $1, $2)
class MyChecker : vertex<V1> { virtual void Execute( const V0::Snapshot & snapshot ) { …. // Invariant logic, writing in sequential style } static int64 Mapping( const V0::tuple & t ) ; // guidance for partitioning};
Part 1: define the dataflow and types of states, and how states are retrieved
Part 2: define the logic and mapping function in each stage for predicates
D3S parallel predicate checker
Lock clients
Checkers
Expose statesindividually
Reconstruct:SN1, SN2, …
Exposed states(C1, L1, E), (C2, L3, S), (C5, L1, S),…
L1L1
(C1,L1,E),(C5,L1,S) (C2,L3,S)
Key: LockID
States and dataflowV0: exposer { ( client: ClientID, lock: LockID, mode: LockMode ) }V1: V0 { ( conflict: LockID ) } as finalafter (ClientNode::OnLockAcquired) addtuple ($0->m_NodeID, $1, $2)after (ClientNode::OnLockReleased) deltuple ($0->m_NodeID, $1, $2)
V0: exposer
Set of (C, L, M)
V1 (checker)
Set of (lock)
Final report
triggersin app
checkingfunction
Source code for Boxwood client:class ClientNode { ClientID m_NodeID; void OnLockAcquired( LockID, LockMode ); void OnLockReleased( LockID, LockMode );};
• Insert a hook to the app using binary rewrite at run time• Triggered at function boundaries to expose app states
Checking functions• Write in C++ language, reuse types• Execute(): run for each snapshot• Mapping(): guide partitioning of
snapshots
V0: exposer
Set of (C, L, M)
V1 (checker)
Set of (lock)
Final report
triggersin app
checkingfunction
class MyChecker : vertex<V1> { void Execute( const V0::Snapshot & SN) { foreach (V0::Tuple t in SN) { if (t.mode == EXCLUSIVE) ex[t.lock]++; else sh[t.lock]++; } foreach (LockID L in ex) { if (ex[L] > 1 || (ex[L] == 1 && sh[L] > 0)) output += V1::Tuple(L); } } int64 Mapping( const V0::Tuple & t ) { return t.lock; }};
Summary of checking language
• Predicate– Any property calculated from a finite number of
consecutive state snapshots
• Highlights– Sequential programs (w/ mapping)– Reuse app types in the script and C++ code– Supports for reducing the overhead (in the paper)
• Incremental checking• Sampling the time or snapshots
Constructing consistent snapshots
• Use Lamport clock to total order states
• Problem: how does the checker know whether it receives all necessary states for a snapshot?
• Solution: detect app node failures and use membership info to construct snapshots
Constructing consistent snapshots
• Membership: external service or built-in heart-beats– Snapshot is correct as long as membership is correct
• When no state being exposed, app node should report its timestamp periodically
A
B
Checker
{ (A, L0, S) }, ts=2
{ (B, L1, E) }, ts=6
{ }, ts=10
ts=12
{ (A, L1, E) }, ts=16
M(2)={A,B}SB(2)=??
M(6)={A,B}SA(6)=??
M(10)={A,B}SA(6)=SA(2) check(6)
Detect failure
SB(10)=SB(6) check(10)
M(16)={A}check(16)
SA(2) SB(6) SA(10) SA(16)
Experimental method
• By debugging 5 real systems, we answer– Can D3S help developers find bugs?– Are predicates simple to write?– Is the checking overhead acceptable?
• None of the apps are written by us!
Case study: Leader-election
• Predicate– There is at most one leader in each group of replicas
• Deployment– 8 machines (1 Gb Ethernet, 2 GHz Intel Xeon CPU, and 4 GB memory)– Test scenario: database app with random I/O (40 MB/s per machine
at peak time)– Randomly crash & restart processes
• Debugging– 3 checkers, partitioned by replica groups– Time to trigger violation: several hours
Root cause of the bug
Coordinator
Replica node Replica node
Failure detector
Failure detector
Checker
leader! Leader!
ReportNode involved, sequence of related states and events.
(catch violation)
timeout
Replica node
• Coordinator crashed and forgot the previous answer• Must write to disk synchronously!
Summary of resultsApplication LoC Predicates LoP Results
PacificA (Structured data storage)
67,263 membership consistency; leader election; consistency among replicas
118 3 correctness bugs
Paxos implement-ation
6,993 consistency in consensus outputs; leader election
50 2 correctness bugs
Web search engine
26,036 unbalanced response time of indexing servers
81 1 performance problem
Chord (DHT) 7,640 aggregate key range coverage; conflict key holders
72 tradeoff bw/ availability & consistency
BitTorrent client
36,117 Health in neighbor set; distribution of downloaded pieces; peer contribution rank
210 2 performance bugs; free riders
Dat
a ce
nter
app
sW
ide
area
app
s
Performance overhead (stress test of PacificA)
2 4 6 8 100
30
60
90
120
150
180
7.21%
4,38%3.94%
4.20%
7.24%
withoutwith
# of clients, each sending 10,000 requests
tim
e to
com
plet
e (s
econ
ds)
• Less than 8%, in most cases less than 4%. • I/O overhead < 0.5%• Overhead is negligible in other checked systems
Related work
• Log analysis – Magpie[OSDI’04], Pip[NSDI’06], X-Trace[NSDI’07]
• Predicate checking at replay time– WiDS Checker[NSDI’07], Friday[NSDI’07]
• P2-based online monitoring– P2-monitor[EuroSys’06]
• Model checking– MaceMC[NSDI’07], CMC[OSDI’04]
Conclusions
• Predicate checking is effective for debugging deployed & large-scale distributed systems
• D3S enables:– Change of what is monitored on-the-fly– Checking with multiple checkers– Specify predicate in sequential & centralized
manner
Design goals• An advanced predicate checker designed for
deployment & large scale
• Deployment– Flexibility: change which states are checked on-the-fly– Low overhead
• Large scale– Distributed checking– Failure-tolerance: continue to check correctly when
• App node fails• Checking machine fails
Case study: PacificA
• A BigTable-like distributed database
• Replica group management– Perfect failure detection on storage node– Group reconfiguration to handle node failures
• Primary-backup replication– Two-phase commit for consistent updates– Data reconciliation when re-joining a node
Case study: PacificA• A bunch of invariants stem from the design
– Group consistency: • single-primary in all replica groups
– Data consistency• same data for the same version number
– Reliability• when committing, all replicas are already prepared
– Correctness of reconciliation• After joining the group, the new node have up-to-date states
– Etc…
• Specify the invariants as predicates, and check them– Necessary to use multiple checkers
• Result: detected 3 correctness bugs caused by atomicity violation and incorrect failure handling
Bug in RSL (Paxos server in Cosmos)
• RSL– 1 primary, 4 secondaries– Two phase commit– Leader election/failure detection
A
B
C
D E
Primary
prepare preparePrimary
Learning Root cause of the “live-lock”:• Prepare node only re-sends requests to the ones that has previously responded to it• A node in “learning” never participates in prepare• Result: D is stuck in preparing for a long time
VerifierDetect the unstable node status
Lesson:• Complete system is error-prone due to optimization and supporting components• Bugs are not always visible to outside• Always-on checking catches “hidden” bugs
Chord overlay
Perfect Ring:• No overlap, no hole• Aggregated key coverage is 100%
???
0 10000200003000040000500006000070000800000%
50%
100%
150%
200% 3 predecessors8 predecessors
time (seconds)
key
ran
ge c
over
age
rati
o
Consistency vs. Availability: cannot get both• Global measure on the factors• See the tradeoff quantitatively for performance tuning• Capable of checking detailed key coverage
0 64 128 192 2560
1
2
3
43 predecessors
8 predecessors
key serial
# of
hit
of c
hord
nod
es
top related