diagnosing missing events in distributed systems with negative provenance yang wu* mingchen zhao*...
TRANSCRIPT
Diagnosing Missing Events in Distributed Systems
with Negative Provenance
Yang Wu* Mingchen Zhao*
Andreas Haeberlen* Wenchao Zhou+ Boon Thau Loo*
* University of Pennsylvania + Georgetown University
1
Internet HTTP ServerData Center Network
Controller
Why is the HTTP servergetting DNS queries?
Motivation: Network debugging
DNSPacket
2
- Example: Software Defined Networks
HTTPPacket
- Need good debuggers
- SDN offers flexibility, but can have bugs
Why is the HTTP servergetting DNS queries?
- Existing tools: SNP (SOSP ‘11), NetSight (NSDI ‘14)
Approach: Provenance
Internet HTTP ServerData Center Network
Controller
DNSPacket
DNS Packet arrived at HTTP Server
DNS Packetreceived at Switch
Broken FlowEntry existed at Switch
… …
…
Program
DNSPacket
BrokenFlowEntry
3
- They produce “backtrace”, or provenance
Internet HTTP ServerData Center Network
Controller
- What if something expected is not happening?
Challenge: Missing events
???
4
Why is the HTTP server
NOT getting requests?
- Cannot be handled by existing tools- Reason: No starting point
Survey: How common are missing events?
5
17%
83%
Outages
48%
52%
NANOG-user
26%
74%
floodlight-dev
- Missing events are consistently in the majority
- Lengthier email threads for missing events
Missing events Positive events
NANOG-user Floodlight-dev Outages
Approach: Counter-factual reasoning
Find all the ways a missing event could have occurred,
Why did Bob NOT arrive at SIGCOMM?
6Philly Chicago
and show why each of them did not happen.
Result: debugger for missing events
Internet HTTP ServerData Center Network
Controller
No HTTP Packet arrived at HTTP Server
No Forwarding-FlowEntry installed at Switch
HTTP Packet received at Switch
Dropping-FlowEntry existed at Switch
… …
Program
… ??????HTTP
Packet
Dropping-FlowEntry
Why is the HTTP server
NOT getting requests?
7
Challenge: Too many possible explanations
8
Why did Bob NOT arrive at SIGCOMM?
Approach Generating Negative Provenance
Improving readability
Background: Provenance
9
Overview
Challenge: Too many explanations
Goal: Diagnose missing eventsWHY NOT ?
Approach: Counter-factual reasoning
SystemY!
R-tree indexing
Explanation usability
EvaluationQuery speed
Explanation size reduction
Experiments
Background: provenance- Captures causality between events
10
DNS Packet arrived at HTTP Server
DNS Packetreceived at Switch
Broken FlowEntry existed at Switch
… …
- Example: ExSPAN (SIGMOD ’10)
PacketSent :- PacketReceived, FlowEntry.
network datalog (NDLOG)
???
FlowEntry
PacketReceived
time t1 t2 t3 t4 t5 now
PacketSent PacketOut
PacketSent :- FlowEntry, PacketReceived.
PacketSent during [t4,t5]
11
Background: How to generate provenance?
PacketSent :- PacketOut.
t6
FlowEntry during [t4,t5] PacketReceived during [t4,t5]
Step 2: Admin issues query when bug occursStep 1: Collecting events from distributed systemStep 3: Provenance graph is generated
Approach Generating Negative Provenance
Improving readability
Background: Provenance
12
Overview
Challenge: Too many explanations
Goal: Diagnose missing eventsWHY NOT ?
Approach: Counter-factual reasoning
SystemY!
R-tree indexing
Explanation usability
EvaluationQuery speed
Explanation size reduction
Experiments
Generating negative provenance graphs
FlowEntry
PacketSent :- FlowEntry, PacketReceived.
PacketReceived
No PacketSent during [t1,now]
???
time t1 t2 t3 t4 t5 now
13
PacketSent
- Explaining why something does not exist- Use missing conditions to explain missing events
Generating negative provenance graphs- explanation can be unnecessarily complex
PacketSent
FlowEntry
PacketReceived
No PacketSent during [t1,now]
time t1 t2 t3 t4 t5 now
No PacketReceivedduring [t1,t2]
No FlowEntryduring [t2,t3] No PacketReceived
during [t3,t4]
No FlowEntry during [t4,t5]
No PacketReceived during [t5,now]
14
Generating negative provenance graphs- We want simple explanations
No PacketSent during [t1,now]
No FlowEntry during [t1,now]
15
PacketSent
FlowEntry
PacketReceived
time t1 t2 t3 t4 t5 now
- Key challenge: explanation can be complex
- This is hard (Set-Cover)
Generating negative provenance graphs- pseudo-code for building negative provenance graph
16
17
Challenge: explanation too complicated
No A atat X No B at
at Y No C atat Z
Approach Generating Negative Provenance
Improving readability
Background: Provenance
18
Overview
Challenge: Too many explanations
Goal: Diagnose missing eventsWHY NOT ?
Approach: Counter-factual reasoning
SystemY!
R-tree indexing
Explanation usability
EvaluationQuery speed
Explanation size reduction
Experiments
Readability: how to improve?
19
No chicken.
…
No egg.
No chicken.
… …
No Packet arrived at Server
No Packet arrived at S1
No Packet arrived at S2
No Packet arrived at S3
…
- Pattern: logical inconsistency. Solution: pruning.- Pattern: transient event chains. Solution: hiding.
Readability: improving techniques
Prune logical inconsistencies.Prune failed assertions.
Branch coalescing.Application-specific invariants.Hide transient event chains.
Summarize super-vertex.
20
21
Readability: concise explanations
Approach Generating Negative Provenance
Improving readability
Background: Provenance
22
SystemY!
R-tree indexing
Overview
Challenge: Too many explanations
Goal: Diagnose missing eventsWHY NOT ?
Approach: Counter-factual reasoning
Explanation usability
EvaluationQuery speed
Explanation size reduction
Experiments
System: Y!
23
Supports arbitrary NDLOG programs.
Efficient event storage: R-tree.
Supports general programs: Pyretic frontend.
System: better index for faster queries
Was there a FlowTable from 3pm to 8pm, whose priority is higher than 255?
Any hotels within 3 miles of SIGCOMM?
≈
24
System: R-tree for faster queries- R-tree: designed to handle high-dimensional queries
Used material from Wikipedia.
- basic idea: multi-dimensional B-tree
25
Approach Generating Negative Provenance
Improving readability
Background: Provenance
26
Overview
Challenge: Too many explanations
Goal: Diagnose missing eventsWHY NOT ?
Approach: Counter-factual reasoning
SystemY!
R-tree indexing
Explanation usability
EvaluationQuery speed
Explanation size reduction
Experiments
Evaluation: Experiments
27
Are negative provenance graphs concise?Are negative provenance graphs useful?
What is the query turnaround time?What is the runtime storage overhead?
Will Y! slow down the distributed system?How runtime storage overhead scales?
How query turnaround time scales?How readability heuristics scales?
Evaluation: How fast are the queries?
28
Query turnaround(seconds)
1.2
0.6
0.9
0.3
- Query turnaround less than one second
1 second
SDN1 SDN2 SDN3 SDN4 SDN5 BGP1 BGP2 BGP3 BGP4
- Nine scenarios reproduced: SDN & BGP
Evaluation: Are answers concise?
29 SDN1 SDN2 SDN3 SDN4 SDN5 BGP1 BGP2 BGP3 BGP4
Original Inconsistencies pruned
Final
- 90%
- Size reduced by over 90%
- Absolute size is always less than 25
25
# Verticesin answers
400
200
300
100
Evaluation: Are answers useful?Why is the HTTP server
NOT getting requests?
No HTTP Packet arrived at HTTP Server
No Forwarding FlowEntryarrived at Intermediate Switch
HTTP Packetsarrived at Border Switch
Forwarding FlowEntryarrived at Border Switch
Broken FlowEntryarrived at Intermediate Switch
Internet HTTP ServerData Center Network
Controller
???
BrokenFlowEntry
S1 S2???
HTTPPacket
30
- Goal: a debugger for missing eventsExample: Why HTTP server is not getting requests?
- Approach: Negative Provenance
Use counterfactual reasoning to explain how the missing events could have occurred.
- Explanation can be simplified.The initial explanation can be complex, but we can generate
concise answers.
- Implementation: Y!
Efficient queries due to a special index.
Usability and scalability evaluations.
Questions? 31
Summary