pip: detecting the unexpected in distributed systems charles killian amin vahdat ucsd patrick...

34
Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds [email protected] du Collaborators: Janet Wiener Jeffrey Mogul Mehul Shah HP Labs http://issg.cs.duke.edu/pip/

Upload: ambrose-rice

Post on 28-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

Pip: Detecting the Unexpected in

Distributed Systems

Charles KillianAmin Vahdat

UCSD

Patrick [email protected]

u

Collaborators:Janet WienerJeffrey MogulMehul Shah

HP Labs

http://issg.cs.duke.edu/pip/

Page 2: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

page 2NSDI - May 8th, 2006

Introduction

• Distributed systems are essential to the Internet• Distributed systems are complex

– Harder to debug than centralized systems

• Pip uses causal paths to find bugs– Characterize system behavior– Deviations from expected behavior often indicate bugs

• Experimental results show that Pip is effective– Found 19 bugs in 6 test systems

Page 3: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

page 3NSDI - May 8th, 2006

Challenges of debugging distributed systems

• Distributed systems have more bugs– Parallelism is hard– New sources of failure

• More components = more faults

• Network errors

• Security breaches

• Finding bugs is harder– More nodes, events, and messages to keep track of– Applications may cross administrative domains– Bugs on one node may be caused by events on

another

Page 4: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

page 4NSDI - May 8th, 2006

Causal path analysis

• Programmers wish to examine and check system-wide behaviors– Causal paths– Components of end-to-end

delay– Attribution of resource

consumption• Unexpected behavior might indicate a bug– Pip compares actual behavior

to expected behavior

Web server

App server

Database

500ms

2000pagefaults

Page 5: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

page 5NSDI - May 8th, 2006

What kinds of bugs?

• Structure bugs– Incorrect placement/timing of processing

and communication

• Performance bugs– Throughput bottlenecks– Over- or under-consumption of resources

• Bugs present in an actual run of the system– Selecting input for path coverage is

orthogonal

Web server

App server

Database

500ms

2000pagefaults

Page 6: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

page 6NSDI - May 8th, 2006

Three target audiences

• Primary programmer– Debugging or optimizing his/her own system

• Secondary programmer– Inheriting a project or joining a programming

team– Discovery: learning how the system behaves

• Operator– Monitoring a running system for unexpected

behavior– Performing regression tests after a change

Page 7: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

page 7NSDI - May 8th, 2006

Outline

• Introduction• Pip

– Expressing expected behavior– Exploring application behavior– Results

• Conclusions

Page 8: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

page 8NSDI - May 8th, 2006

Workflow

Pip:1. Captures events from a running

system2. Reconstructs behavior from events3. Checks behavior against

expectations4. Displays unexpected behavior

Both structure and resource violations

Goal: help programmers locate and explain bugs

Behaviormodel

Application

Application

Expectations

Pip checker

Unexpectedstructure

Resourceviolations

Pip explorer: visualization GUI

Page 9: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

page 9NSDI - May 8th, 2006

Describing application behavior

• Application behavior consists of paths– All events, on any node, related to one high-level

operation– Definition of a path is programmer defined– Path is often causal, related to a user request

WWW

App srv

DB

Parse HTTP

Query

Send response

Run application

time

Page 10: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

page 10NSDI - May 8th, 2006

Describing application behavior

• Within paths are tasks, messages, and notices– Tasks: processing with start and end points– Messages: send and receive events for any

communication• Includes network, synchronization (lock/unlock), and

timers

– Notices: time-stamped strings; essentially log entries

WWW

App srv

DB

Parse HTTP

Query

Send response

Run application

time

“Request = /cgi/…” “2096 bytes in response”

“done with request 12”

Page 11: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

page 11NSDI - May 8th, 2006

Outline

• Introduction• Pip

– Expressing expected behavior– Exploring application behavior– Results

• Conclusions

Page 12: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

page 12NSDI - May 8th, 2006

Expectations

• Pip has two kinds of expectations

• Recognizers classify paths into sets

• Aggregates assert properties of sets

R1

R2

R3

?

Page 13: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

page 13NSDI - May 8th, 2006

Expectations: recognizers

• Application behavior consists of paths• Each recognizer matches paths

– A path can match more than one recognizer• A recognizer can be a validator, an invalidator, or

neither• Any path matching zero validators or at least one

invalidator is unexpected behavior: bug?validator CGIRequest thread WebServer(*, 1) task(“Parse HTTP”) limit(CPU_TIME, 100ms); notice(m/Request URL: .*/); send(AppServer); recv(AppServer);

invalidator DatabaseError notice(m/Database error: .*/);

Page 14: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

page 14NSDI - May 8th, 2006

Expectations: recognizers language

• repeat: matches a ≤ n ≤ b copies of a block

• xor: matches any one of several blocks

• future: lets a block match now or later– done: forces the named block to match

repeat between 1 and 3 { … }

xor {branch: …branch: …

}

future F1 { … }…done(F1);

Page 15: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

page 15NSDI - May 8th, 2006

Expectations: aggregate expectations

• Recognizers categorize paths into sets• Aggregates make assertions about sets of paths

– Instances, unique instances, resource constraints– Simple math and set operators

assert(instances(CGIRequest) > 4);assert(max(CPU_TIME, CGIRequest) < 500ms);assert(max(REAL_TIME, CGIRequest) <= 3*avg(REAL_TIME, CGIRequest));

Page 16: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

page 16NSDI - May 8th, 2006

Automating expectations and annotations

• Expectations can be generated from behavior model– Create a recognizer for each

actual path– Eliminate repetition– Strike a balance between

over- and under-specification

• Annotations can be generated by middleware– As in several of our test

systems

Behaviormodel

Application

Application

Expectations

Pip checker

Annotations

Unexpectedbehavior

Page 17: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

page 17NSDI - May 8th, 2006

Exploring behavior

•Expectations checker generates lists of valid and invalid paths

•Explore both sets– Why did invalid paths occur?– Is any unexpected behavior misclassified as

valid?• Insufficiently constrained expectations•Pip may be unable to express all expectations

Page 18: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

page 18NSDI - May 8th, 2006

Timing and resource properties for one taskCausal view of path

Visualization: causal paths

Caused tasks, messages, and notices on that thread

Page 19: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

page 19NSDI - May 8th, 2006

Visualization: timelines

• View one path instance as a timeline– Different colors = different hosts– Each bar = task – Click a bar to see task details

Page 20: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

page 20NSDI - May 8th, 2006

Visualization: performance graphs

• Plot per-task or per-path resource metrics– Cumulative distribution (CDF)– Probability density (PDF)– Changing values over time

• Click on a point to see its value and the task/path represented

Time (s)

Dela

y (

ms)

Page 21: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

page 21NSDI - May 8th, 2006

Results

• We have applied Pip to several distributed systems:– FAB: distributed block store– SplitStream: DHT-based multicast protocol– Others: RanSub, Bullet, SWORD, Oracle of Bacon

• We have found unexpected behavior in each system

• We have fixed bugs in some systems… and used Pip to verify that the behavior was fixed

Page 22: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

page 22NSDI - May 8th, 2006

Results: SplitStreamDHT-based multicast protocol

13 bugs found, 12 fixed– 11 found using expectations, 2 found using GUI

• Structural bug: some nodes have up to 25 children when they should have at most 18– This bug was fixed and later reoccurred– Root cause #1: variable shadowing– Root cause #2: failed to register a callback

• How discovered: first in the explorer GUI, confirmed with automated checking

Page 23: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

page 23NSDI - May 8th, 2006

Results: FABDistributed block store

1 bug found, fixed– Four protocols checked: read, write, Paxos,

membership

• Performance bug: quorum operations call nodes in arbitrary order– Should overlap computation when possible

Page 24: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

page 24NSDI - May 8th, 2006

Results: FABDistributed block store

• For disk-bound workloads, call self last• For cache-bound workloads, call self second

– Actually, last in quorum

Page 25: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

page 25NSDI - May 8th, 2006

Conclusions

• Causal paths are a useful abstraction of distributed system behavior

• Expectations serve as a high-level description– Summary of inter-component behavior and timing– Regression test for structure and performance

• Finding unexpected behavior can help us find bugs– Both structure and performance bugs– Visualization can expose additional bugs

http://issg.cs.duke.edu/pip/

Page 26: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

NSDI - May 8th, 2006 26

Extra slides

• Related work• Pip results: RanSub• Pip vs. printf• Pip resource metrics• Building a behavior model• Annotations• Checking expectations• Visualization: communication graph

Page 27: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

page 27NSDI - May 8th, 2006

Related work

• Causal paths– Pinpoint [Chen, 2004]– Magpie [Barham, 2004]

• Black-box inferences– Finding stepping stones [Zhang, 1997]– Wavelets to debug network performance [Huang,

2001]• Expectations-checking systems

– PSpec [Perl, 1993]– Meta-level compilation [Engler, 2000]– Paradyn [Miller, 1995]

• Model checking– MaceMC [Killian, 2006]– VeriSoft [Godefroid, 2005]

Page 28: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

page 28NSDI - May 8th, 2006

Pip results: RanSub

2 bugs found, 1 fixed• Structural bug: during first round of communication, parent nodes send summary messages before hearing from all children– Root cause: uninitialized state variables

• Performance bug: linear increase in end-to-end delay for the first ~2 minutes– Suspected root cause: data structure listing all

discovered nodes

Page 29: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

page 29NSDI - May 8th, 2006

Pip vs. printf

• Both record interesting events to check off-line– Pip imposes structure and automates checking– Generalizes ad hoc approaches

Pip printfNesting, causal order UnstructuredTime, path, and thread No contextCPU and I/O data No resource informationAutomatic verification using declarative language

Verification with ad hoc grep or expect scripts

SQL queries “Queries” using Perl scripts

Automatic generation for some middleware

Manual placement

Page 30: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

page 30NSDI - May 8th, 2006

Pip resource metrics

• Real time• User time, system time

– CPU time = user + system– Busy time = CPU time / real time

• Major and minor page faults (paging and allocation)• Voluntary and involuntary context switches• Message size and latency• Number of messages sent• Causal depth of path• Number of threads, hosts in path

Page 31: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

page 31NSDI - May 8th, 2006

Building a behavior model

Sources of events:• Annotations in source code

– Programmer inserts statements manually• Annotations in middleware

– Middleware inserts annotations automatically– Faster and less error-prone

• Passive tracing or interposition– Easier, but less information

• Or any combination of the above

Model consists of paths constructed from events recorded by the running application

Page 32: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

page 32NSDI - May 8th, 2006

Annotations

• Set path ID• Start/end task• Send/receive message• Notice

WWW

App srv

DB

Parse HTTP

Query

Send response

Run application

time

“Request = /cgi/…” “2096 bytes in response”

“done with request 12”

Page 33: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

page 33NSDI - May 8th, 2006

Checking expectations

Traces

Categorized paths

Reconciliation

Events database

Paths

Path construction

Expectation checking

Application

Application

For each path P For each recognizer R Does R match P?Check each aggregate

Expectations

Match start/end task, send/receive message

Organize events into causal paths

Page 34: Pip: Detecting the Unexpected in Distributed Systems Charles Killian Amin Vahdat UCSD Patrick Reynolds reynolds@cs.duke.edu Collaborators: Janet Wiener

page 34NSDI - May 8th, 2006

Visualization: communication graph

• Graph view of all host-to-host network traffic– Nodes = hosts, edges = network links in use– Directed edge = unidirectional communication