diagnosing missing events in distributed systems with negative provenance yang wu* mingchen zhao*...

31
Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University of Pennsylvania + Georgetown University 1

Upload: roland-edds

Post on 14-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University

Diagnosing Missing Events in Distributed Systems

with Negative Provenance

Yang Wu* Mingchen Zhao*

Andreas Haeberlen* Wenchao Zhou+ Boon Thau Loo*

* University of Pennsylvania + Georgetown University

1

Page 2: Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University

Internet HTTP ServerData Center Network

Controller

Why is the HTTP servergetting DNS queries?

Motivation: Network debugging

DNSPacket

2

- Example: Software Defined Networks

HTTPPacket

- Need good debuggers

- SDN offers flexibility, but can have bugs

Page 3: Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University

Why is the HTTP servergetting DNS queries?

- Existing tools: SNP (SOSP ‘11), NetSight (NSDI ‘14)

Approach: Provenance

Internet HTTP ServerData Center Network

Controller

DNSPacket

DNS Packet arrived at HTTP Server

DNS Packetreceived at Switch

Broken FlowEntry existed at Switch

… …

Program

DNSPacket

BrokenFlowEntry

3

- They produce “backtrace”, or provenance

Page 4: Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University

Internet HTTP ServerData Center Network

Controller

- What if something expected is not happening?

Challenge: Missing events

???

4

Why is the HTTP server

NOT getting requests?

- Cannot be handled by existing tools- Reason: No starting point

Page 5: Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University

Survey: How common are missing events?

5

17%

83%

Outages

48%

52%

NANOG-user

26%

74%

floodlight-dev

- Missing events are consistently in the majority

- Lengthier email threads for missing events

Missing events Positive events

NANOG-user Floodlight-dev Outages

Page 6: Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University

Approach: Counter-factual reasoning

Find all the ways a missing event could have occurred,

Why did Bob NOT arrive at SIGCOMM?

6Philly Chicago

and show why each of them did not happen.

Page 7: Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University

Result: debugger for missing events

Internet HTTP ServerData Center Network

Controller

No HTTP Packet arrived at HTTP Server

No Forwarding-FlowEntry installed at Switch

HTTP Packet received at Switch

Dropping-FlowEntry existed at Switch

… …

Program

… ??????HTTP

Packet

Dropping-FlowEntry

Why is the HTTP server

NOT getting requests?

7

Page 8: Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University

Challenge: Too many possible explanations

8

Why did Bob NOT arrive at SIGCOMM?

Page 9: Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University

Approach Generating Negative Provenance

Improving readability

Background: Provenance

9

Overview

Challenge: Too many explanations

Goal: Diagnose missing eventsWHY NOT ?

Approach: Counter-factual reasoning

SystemY!

R-tree indexing

Explanation usability

EvaluationQuery speed

Explanation size reduction

Experiments

Page 10: Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University

Background: provenance- Captures causality between events

10

DNS Packet arrived at HTTP Server

DNS Packetreceived at Switch

Broken FlowEntry existed at Switch

… …

- Example: ExSPAN (SIGMOD ’10)

PacketSent :- PacketReceived, FlowEntry.

network datalog (NDLOG)

Page 11: Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University

???

FlowEntry

PacketReceived

time t1 t2 t3 t4 t5 now

PacketSent PacketOut

PacketSent :- FlowEntry, PacketReceived.

PacketSent during [t4,t5]

11

Background: How to generate provenance?

PacketSent :- PacketOut.

t6

FlowEntry during [t4,t5] PacketReceived during [t4,t5]

Step 2: Admin issues query when bug occursStep 1: Collecting events from distributed systemStep 3: Provenance graph is generated

Page 12: Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University

Approach Generating Negative Provenance

Improving readability

Background: Provenance

12

Overview

Challenge: Too many explanations

Goal: Diagnose missing eventsWHY NOT ?

Approach: Counter-factual reasoning

SystemY!

R-tree indexing

Explanation usability

EvaluationQuery speed

Explanation size reduction

Experiments

Page 13: Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University

Generating negative provenance graphs

FlowEntry

PacketSent :- FlowEntry, PacketReceived.

PacketReceived

No PacketSent during [t1,now]

???

time t1 t2 t3 t4 t5 now

13

PacketSent

- Explaining why something does not exist- Use missing conditions to explain missing events

Page 14: Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University

Generating negative provenance graphs- explanation can be unnecessarily complex

PacketSent

FlowEntry

PacketReceived

No PacketSent during [t1,now]

time t1 t2 t3 t4 t5 now

No PacketReceivedduring [t1,t2]

No FlowEntryduring [t2,t3] No PacketReceived

during [t3,t4]

No FlowEntry during [t4,t5]

No PacketReceived during [t5,now]

14

Page 15: Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University

Generating negative provenance graphs- We want simple explanations

No PacketSent during [t1,now]

No FlowEntry during [t1,now]

15

PacketSent

FlowEntry

PacketReceived

time t1 t2 t3 t4 t5 now

- Key challenge: explanation can be complex

- This is hard (Set-Cover)

Page 16: Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University

Generating negative provenance graphs- pseudo-code for building negative provenance graph

16

Page 17: Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University

17

Challenge: explanation too complicated

No A atat X No B at

at Y No C atat Z

Page 18: Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University

Approach Generating Negative Provenance

Improving readability

Background: Provenance

18

Overview

Challenge: Too many explanations

Goal: Diagnose missing eventsWHY NOT ?

Approach: Counter-factual reasoning

SystemY!

R-tree indexing

Explanation usability

EvaluationQuery speed

Explanation size reduction

Experiments

Page 19: Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University

Readability: how to improve?

19

No chicken.

No egg.

No chicken.

… …

No Packet arrived at Server

No Packet arrived at S1

No Packet arrived at S2

No Packet arrived at S3

- Pattern: logical inconsistency. Solution: pruning.- Pattern: transient event chains. Solution: hiding.

Page 20: Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University

Readability: improving techniques

Prune logical inconsistencies.Prune failed assertions.

Branch coalescing.Application-specific invariants.Hide transient event chains.

Summarize super-vertex.

20

Page 21: Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University

21

Readability: concise explanations

Page 22: Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University

Approach Generating Negative Provenance

Improving readability

Background: Provenance

22

SystemY!

R-tree indexing

Overview

Challenge: Too many explanations

Goal: Diagnose missing eventsWHY NOT ?

Approach: Counter-factual reasoning

Explanation usability

EvaluationQuery speed

Explanation size reduction

Experiments

Page 23: Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University

System: Y!

23

Supports arbitrary NDLOG programs.

Efficient event storage: R-tree.

Supports general programs: Pyretic frontend.

Page 24: Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University

System: better index for faster queries

Was there a FlowTable from 3pm to 8pm, whose priority is higher than 255?

Any hotels within 3 miles of SIGCOMM?

24

Page 25: Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University

System: R-tree for faster queries- R-tree: designed to handle high-dimensional queries

Used material from Wikipedia.

- basic idea: multi-dimensional B-tree

25

Page 26: Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University

Approach Generating Negative Provenance

Improving readability

Background: Provenance

26

Overview

Challenge: Too many explanations

Goal: Diagnose missing eventsWHY NOT ?

Approach: Counter-factual reasoning

SystemY!

R-tree indexing

Explanation usability

EvaluationQuery speed

Explanation size reduction

Experiments

Page 27: Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University

Evaluation: Experiments

27

Are negative provenance graphs concise?Are negative provenance graphs useful?

What is the query turnaround time?What is the runtime storage overhead?

Will Y! slow down the distributed system?How runtime storage overhead scales?

How query turnaround time scales?How readability heuristics scales?

Page 28: Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University

Evaluation: How fast are the queries?

28

Query turnaround(seconds)

1.2

0.6

0.9

0.3

- Query turnaround less than one second

1 second

SDN1 SDN2 SDN3 SDN4 SDN5 BGP1 BGP2 BGP3 BGP4

- Nine scenarios reproduced: SDN & BGP

Page 29: Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University

Evaluation: Are answers concise?

29 SDN1 SDN2 SDN3 SDN4 SDN5 BGP1 BGP2 BGP3 BGP4

Original Inconsistencies pruned

Final

- 90%

- Size reduced by over 90%

- Absolute size is always less than 25

25

# Verticesin answers

400

200

300

100

Page 30: Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University

Evaluation: Are answers useful?Why is the HTTP server

NOT getting requests?

No HTTP Packet arrived at HTTP Server

No Forwarding FlowEntryarrived at Intermediate Switch

HTTP Packetsarrived at Border Switch

Forwarding FlowEntryarrived at Border Switch

Broken FlowEntryarrived at Intermediate Switch

Internet HTTP ServerData Center Network

Controller

???

BrokenFlowEntry

S1 S2???

HTTPPacket

30

Page 31: Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University

- Goal: a debugger for missing eventsExample: Why HTTP server is not getting requests?

- Approach: Negative Provenance

Use counterfactual reasoning to explain how the missing events could have occurred.

- Explanation can be simplified.The initial explanation can be complex, but we can generate

concise answers.

- Implementation: Y!

Efficient queries due to a special index.

Usability and scalability evaluations.

Questions? 31

Summary