online testing of distributed systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfnice...

44
Online Testing of Distributed Systems Dejan Kostić EPFL, Switzerland Networked Systems Laboratory

Upload: others

Post on 27-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

Online Testing of

Distributed Systems

Dejan Kostić

EPFL, Switzerland

Networked Systems Laboratory

Page 2: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

Distributed Systems

2

Foundation of our society's infrastructure

• Enterprise storage, Web services, Internet

Page 3: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

Internet Routing is Unreliable

• Operator mistakes (hijacks: Pakistan 2008, China 2010)

3

Pakistan ISP

Hijacked 10%

of the Internet

(Dell, CNN,

Amazon.de, ...)

Popular site inaccessible for ~ 2 hours!

Page 4: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

Router type A

Router type B

At 17:07:26 UTC on August 19, 2009 CNCI

(AS9354), a small network service provider in

Nagoya, Japan, advertised a handful of BGP

updates containing an empty AS4_PATH

attribute. [renesys blog]

? Reset session!

0-length AS4_PATH

attribute!

Protocol-compliant,

confusing message

Software Bugs in Inter-Domain Routers

Page 5: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

…what could possibly go wrong?

5

10x

increase

routing

instabilities

[renesys blog]

Page 6: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

Unaffected router

Affected router

?

?

?

?

?

Un

rea

cha

ble

!

Repeated service disruptions

What went wrong (Cisco session reset flood)

Locally admissible action can

have harmful global effect

Local testing is insufficient

Page 7: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

Vision

Harness the increases in computational power

and bandwidth to improve system reliability

7

Nu

mbe

r of

transis

tors

[http://www.nordicbroadbandblog.com/] Moore’s law

Page 8: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

Detect Possible Faults and Remove Them!

8

B

C A

D

Time

Live

Detection (in parallel)

Page 9: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

Key Insight: Online Testing

9

Time

Sp

ace

Page 10: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

Outline

• CrystalBall

– Introduced live model checking to predict and

avoid inconsistencies in distributed systems

• DiCE

– Enabled systematic “what-if” exploration across

heterogeneous, federated distributed systems

• NICE

– Combined model checking and symbolic

execution to test OpenFlow applications

10

Page 11: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

N1 N2 N3

Model Checking DS to Predict the Future

11

Rst

Timer

Message Reset

N1 N2 N3

• State space exploration, e.g.,

MaceMC [Killian et al.,NSDI‘07]

– Run each enabled handler in a set of

nodes (live code)

– Check safety properties in each state

Exponential explosion

in the number of states!

State • Local actions

• Timers

• Node reset

• Network messages

Page 12: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

12

B

C A

D

C’

B’

A’

New concept: periodically collect consistent

snapshot and execute state space exploration

from current state

CrystalBall [NSDI ’09, TOCS ‘10]

Page 13: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

CrystalBall Benefits

13

a b c d

a) Classic model

checking

b) Replay-based/

live predicate

checking

c) CB: deep online

debugging

d) CB: execution

steering

State-of-the-art

in distributed

debugging

Page 14: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

Consequence Prediction

14

• Goal: quickly judge consequences

of node actions

• State-of-the-art is too slow

– 30 min for 6 steps

Key idea:

– Remove previously explored

local node actions

• Network messages can

still be interleaved

– Factor of 100 speedup Already explored

this node’s local

actions

Page 15: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

Inconsistencies Found

System Inconsistencies not found by MaceMC

RandTree 2

Chord 2

Bullet’ 3

15

• Previously, these systems were

– Debugged both in local- and wide-area settings,

– Model checked using MaceMC (RandTree, Chord)

} Detected

violations of

MaceMC safety

properties

Page 16: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

Execution Steering (ES)

• Goal: increase resilience of deployed systems

Initial focus on generic runtime mechanisms

• Goal: prevent inconsistencies without

introducing new ones

– As difficult as guaranteeing that the entire

program is error-free

• Hard problem

16

Page 17: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

Execution Steering (ES)

• Approach:

– Install an event filter to block offending action

– Use sound filters (events normally occurring)

• Break TCP connection/drop UDP message

– Run consequence prediction to check the effect

of the event filter on system

17

B C

Page 18: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

bug1 bug2

violations

avoided by im.safety check

avoided byexecutionsteering

Paxos Execution Steering

18

• Paxos: protocol

for achieving

consensus

•Stress test:

2x100 live runs

that expose

Paxos safety

violations due to

two injected

bugs

•One CPU core

for CrystalBall

< 5% False negatives

Small performance

impact (5%)

Page 19: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

Challenges in Widely-Deployed Systems

• Heterogeneity

– Multiple implementations, versions, patches

Difficult/impossible to obtain source/binary code

• Federation

– Multiple administrative domains

Cannot obtain internal state/configuration

19

Cannot use CrystalBall to detect Internet routing faults!

Page 20: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

DiCE [USENIX ’11, TR ‘11]

Node

neighborhood

Node

process

DiCE

Automatically explore system behavior to detect faults

1. Create an isolated snapshot of a node’s neighborhood

2. Subject node to many relevant inputs to exercise its actions

3. For each input, check if the snapshot misbehaves

Page 21: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

Goal: Isolate testing from production environment,

while observing “societal” boundaries

DiCE Snapshot

Local checkpoint

of current state

and configuration

BGP is a federated environment Each router keeps its local checkpoint Private state & config stays in the AS

To run one exploration

(“what-if” scenario):

lazily clone the snapshot

Page 22: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

Exploration of Behavior

Clone of BGP process

DiCE

Error!

To systematically exercise node behavior: Use a path exploration engine

Is there an error?

1 2 3

Input

Page 23: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

Concolic (CONCrete+symbOLIC) Execution [Godefroid05, Cadar06, Cadar08, Burnim08, Crameri09]

23

real input X = 20 X = 5 X = 0 real input,

for X > 0?

real input,

X > 0 &&

X <10?

X < 10

X > 0

yes

yes no

no

X > 0

X < 10

yes

yes no

no

X > 0

X < 10

yes

yes no

no

Challenges:

• Coverage

– Deal with exponential number of paths

• Large inputs

– Handle systems with long running times

• Short running time

– To allow for preventative measures

Automatically uncover test

inputs using code itself!

Page 24: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

Live

(production)

Current,

live state

Path

exploration

engine Exploration

(Message)

Handler

Code

Input generation

Input

control Input

Managing Path Explosion and

Generating Inputs

Path constraints

Perturb

list

Fuzz

msg

Config

change

Explorer

Node

Page 25: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

Check properties that capture desired behavior (e.g., safety)

• Example: Harmful Global Events (session resets)

• Safety properties can be automatically inferred [SRDS’11]

?

?

?

?

? ∑

DiCE

explorer

f()

f()

f()

f()

f()

f()

f()

f()

f()

1 BGP error

1 BGP error

1 BGP error

1 BGP error

1 BGP error

0

0

0

0

Unaffected router

Affected router

5 BGP

errors

Valid but

ambiguous

messages

Error count > threshold?

Report:

List of packets that cause

harmful global behavior

Detecting Faults

Page 26: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

Integration Details

• Integrated DiCE with BGP in the BIRD router

– Open-source router, coded in C

– Fuzzed path attributes in BGP messages

– Marked symbolic inputs in BIRD (a few LoC)

• Integrated DiCE with MaraDNS

– Open-source DNS server, coded in C

– 74 LoC to integrate with the concolic engine

26

BGP UPDATE

Header

Withdrawn Routes

Path Attributes

Attribute Type | Length | Value

Network Layer Reachability

Information NLRI Length | Prefix

Page 27: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

BGP Evaluation

27

• Microbenchmark:

– Less than 5% performance impact during exploration

• on a single CPU core, while processing routing updates

• Emulated 27-router network

– Using ModelNet [OSDI ‘02]

• BGP properties that capture resilient faults

1. Harmful Global Events

2. Origin Misconfiguration

3. Policy Conflicts

Page 28: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

BGP Evaluation Results

– Detected

– session reset and

– origin misconfiguration

28

Explored all paths in the UPDATE handlers + across

the Internet-like testbed in ~4 min avg (11 min max)

Page 29: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

101

Host B Host A

Switch 2

Install rule;

forward packet

Packet

OpenFlow

program Controller

Match Actions Counters

Flow Table

Rule 1

Rule 2

Rule N

Switch 1

Default: forward

to controller

Presence of a logically centralized controller

does not eliminate bugs!

Page 30: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

Bugs in OpenFlow Apps

OpenFlow

program

Host B Host A

Switch 2

Install

rule

Switch 1

Packet

Install rule

(delayed)

?

Time

Controller

Page 31: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

NICE [NSDI ‘12]

No bugs In Controller Execution

Network

model

Switch

1

Switch

2

Switch

3

Switch

4

Unmodified

OpenFlow

program 1. Values of packet

header fields

2. Orderings of

packet arrivals

and events

Exercise possible corner cases by systematically

exploring the state space via model checking

Subject unmodified Apps to a multitude of inputs:

Host A Host B

Page 32: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

Building a State-Space Model

Transition system:

• Components

• Channels

• State

• Transitions

• System execution

Introduced simplified

switch model for 7x

size reduction

State

0

sen

d(p

kt 1

)

State

1

State

2

State

3

State

4

State

5 clien

t 1

Switch and network

state is large

Page 33: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

Overcoming Scalability Challenges

Large # of

possible

packets

Large # of

possible

event orderings

Data-plane driven Complex network behavior

Enumerating

all inputs and

event orderings is

intractable Equivalence

classes of

packets

Domain-

specific search

heuristics

Page 34: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

Combating Huge Space of Packets

Use Symbolic Execution

– Code itself reveals relevant packets

• 1 path = 1 eq. class =

1 packet to inject

Example equivalence classes:

1. Broadcast destination

2. Unknown unicast destination

3. Known unicast destination

dst

bcast?

Flood

packet

Install rule

and forward

packet

dst in

table?

yes

no

no

yes

Unreachable

from initial

state

Page 35: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

State

1

New relevant

packets:

[pkt2, pkt3]

Enable new

transitions: client1 / send(pkt2)

client1 / send(pkt3)

Symbolic

execution

of packet_in

handler

Controller

state

client1

discover packets

State

2

State

3

client1

send(pkt2)

State

4

discover packets transition:

State

0

client1

send (pkt1)

Controller

state

changes

Augmenting Model Checking with

Symbolic Execution

Page 36: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

Combating Huge Space of Orderings

MC

+

SE

MC

FLOW-IR

NO-DELAY

UNUSUAL

OpenFlow-specific search strategies for up to

20x state-space reduction:

5x faster than state-of-the-art tools (e.g., JPF)

Page 37: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

Specifying App Correctness

• Library of common properties

– No forwarding loops

– No black holes

– Direct paths (no unnecessary flooding)

– …

• Exposed API to define app-specific properties

– Python code snippets that check network state

Page 38: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

Prototype & Experiences

• Built a NICE prototype in Python

• Tested 3 unmodified NOX OpenFlow Apps

– MAC-learning switch (pyswitch.py)

– Web Server Load Balancer [Wang et al., HotICE ’11]

– Energy-Efficient Traffic Engineering [CoNEXT ’11]

• NICE found 11 property violationsbugs

– Violations found in a few seconds (30 min max)

– 1-2 simple bugs (forgotten packet in ctrl)

– 3 insidious bugs due to subtle race conditions

Page 39: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

Future Work

• Test with real OpenFlow switches

• Incorporate online safety checks into NICE

• Simplify development and deployment of

distributed systems

39

Page 40: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

More information about NICE

• NICE source code + technical report:

http://prophet.epfl.ch

40

Page 41: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

41

Related Work

• Model checking

– CMC [Musuvathi02], MaceMC [Killian07], MODIST [Yang09]

• Offline/online predicate checking

– Friday [Geels07], OverLog/P2 [Singh06], D3S [Liu08]

• Symbolic execution

– Exe [Cadar06], Klee [Cadar08], KleeNet [Sasnauskas10], S2E [Chipounov11]

• Model checking + Symbolic execution

– Java PathFinder [Visser03]

• Static analysis of network configurations

– RCC [Feamster05], FlowChecker [Al-Shaer10], AntEater [Mai11]

• Runtime mechanisms

– {Peer,Net}Review [Haeberlen07,09], Bouncer [Costa07],

Rx [Qin05], BTR [Keller09], ShadowConfig [Alimi09]

Page 42: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

Conclusion

• Hard to make distributed systems reliable

– Even harder when they are heterogeneous and federated

• Our vision:

– At runtime, detect possible deviations from correct behavior

– Steer execution away from the predicted faults

– New way of achieving reliability in distributed systems

• The systems we built, CrystalBall/DiCE:

– Quickly detect important classes of faults, with low overhead

• NICE automates testing of unmodified OpenFlow apps

– Combines model checking with symbolic execution

– Uses domain-specific search strategies

42

Page 43: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

Thanks!

CrystalBall: Maysam Yabandeh, Nikola Knežević, Viktor Kuncak

43

DiCE: Marco Canini, Daniele Venzano, Vojin Jovanović, Gautam Kumar,

Dejan Novaković, Boris Spasojević, Olivier Crameri

NICE: Marco Canini, Daniele Venzano, Peter Perešíni, Jennifer Rexford

Page 44: Online Testing of Distributed Systemsnetseminar.stanford.edu/past_seminars/seminars/01_19_12.pdfNICE [NSDI ‘12] No bugs In Controller Execution Network model Switch 1 Switch 2 Switch

Questions?

44