i know w hat y our packet did last hop: using packet histories to troubleshoot networks

34
I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks Nikhil Handigol With Brandon Heller, Vimal Jeyakumar, David Mazières, Nick McKeown NSDI 2014, Seattle, WA April 2, 2014

Upload: kimball

Post on 23-Feb-2016

22 views

Category:

Documents


0 download

DESCRIPTION

I Know W hat Y our Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks. Nikhil Handigol With Brandon Heller, Vimal Jeyakumar , David Mazières , Nick McKeown NSDI 2014, Seattle, WA April 2, 2014. Bug Story: Incomplete Handover. A. Match: Action - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

I Know What Your Packet Did Last Hop: Using Packet Histories

to Troubleshoot Networks

Nikhil HandigolWith

Brandon Heller, Vimal Jeyakumar, David Mazières, Nick McKeownNSDI 2014, Seattle, WA

April 2, 2014

Page 2: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

2

Bug Story: Incomplete HandoverA

B

Switch X

WiFi AP Y WiFi AP Z

Match: ActionSrc A, Dst B: Output to Y

Page 3: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

3

Network Outagesmake news headlines

Hosting.com's New Jersey data center was taken down on June 1, 2010, igniting a cloud outage and connectivity loss for nearly two hours… Hosting.com said the connectivity loss was due to a software bug in a Cisco switch that caused the switch to fail.

On April 26, 2010, NetSuite suffered a service outage that rendered its cloud-based applications inaccessible to customers worldwide for 30 minutes… NetSuite blamed a network issue for the downtime.

The Planet was rocked by a pair of network outages that knocked it off line for about 90 minutes on May 2, 2010. The outages caused disruptions for another 90 minutes the following morning.... Investigation found that the outage was caused by a fault in a router in one of the company's data centers.

Page 4: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

4

Troubleshooting Networks is Hard Today

ping

traceroute

tcpdump/SPAN/sFlowSNMP

Forwarding State

Forwarding State

Forwarding State

Forwarding State

Forwarding State

• Tedious and ad hoc• Requires skill and experience• Not guaranteed to provide helpful answers

Lots and lotsof graphs

(source: NANOG Survey in “Automatic Test Packet Generation”, Hongyi Zeng, et. al.)

Page 5: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

5

We want complete network visibility

ping

traceroute sFlow

SNMP

Complete visibility: every event that ever happened to every packet

0 100

We want tobe hereVisibility Spectrum

Page 6: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

6

Talk Outline

1. How to achieve complete network visibility– An abstraction: Packet History– A platform: NetSight

2. Why achieving complete visibility is feasible– Data compression– MapReduce-style scale-out design

Page 7: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

7

Forwarding State

Forwarding State

Forwarding State

Forwarding State

Packet History

Packet history = Path taken by a packet + Header

modifications +

Switch state encountered

Page 8: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

8

Our Troubleshooting Workflow

Forwarding State

Forwarding State

Forwarding State

Forwarding State

1. Record and store all packet histories2. Query and use packet histories of errant packets

Page 9: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

9

NetSightA platform to capture and filter

packet histories of interest

Page 10: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

10

PostcardCollector

Control Plane

Flow Table State Recorder

Page 11: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

11

PostcardCollector

Control Plane

Flow Table State RecorderMatch ACT

Match ACT

Page 12: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

12

PostcardCollector

Control Plane

Flow Table State Recorder

Page 13: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

13

PostcardCollector

Control Plane

Flow Table State Recorder

Version -> Flow Table State

Packet Header

Switch ID Output port

Version

Step 1: Generate postcards

Page 14: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

14

Reconstructing Packet HistoriesStep 2: Group postcards by generating packet

Packet Header

Switch ID Output port

Version

Packet Header

Switch ID Output port

Version

Packet Header

Switch ID Output port

Version

Packet Header

Switch ID Output port

Version

Page 15: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

15

Reconstructing Packet HistoriesStep 3: Sort postcards using topology

Topo-sort

Page 16: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

16

PostcardCollector

Control Plane

Flow Table State Recorder

1. <Match, Action>2. <Match, Action>3. <Match, Action>4. <Match, Action>5. <Match, Action> 6. …7. …

1. <Match, Action>2. <Match, Action>3. <Match, Action>4. <Match, Action>5. <Match, Action> 6. …7. …

1. <Match, Action>2. <Match, Action>3. <Match, Action>4. <Match, Action>5. <Match, Action> 6. …7. …

1. <Match, Action>2. <Match, Action>3. <Match, Action>4. <Match, Action>5. <Match, Action> 6. …7. …

Page 17: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

17

Control Plane

Flow Table State Recorder

PostcardCollector

NetSight APIPacket History Filter: A regular-expression-like language to specify packet histories of interest

• Reachability errors• Isolation violation• Black holes• Waypoint routing violation

Troubleshooting Apps

Postcards

Packet History Assembly

Troubleshooting Application

Troubleshooting Application

Troubleshooting Application

Troubleshooting App

FilteredPacket Histories

Page 18: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

18

Bug Story: Incomplete Handover

Packet History Filter

Packet History

WiFi AP Y WiFi AP Z

Switch X

Packet History Filter“Pkts from server not reaching the client”

Packet HistorySwitch X:inport: p0, outports: [p1] mods: [...] state version: 3

Switch Y:inport p1, outports: [p3]mods: ...…

Y

X

Page 19: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

19

Troubleshooting Apps

netsharknprof

ndb netwatch

ndb:Interactivenetwork debugger

netwatch:Live networkinvariant monitor

netshark:Network-widewireshark

nprof:Hierarchicalnetwork profiler

Page 20: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

20

But will it scale?

Page 21: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

21

Why generating postcards for every packet at every hop is crazy!

Network Overhead– 64 byte-postcard/pkt/hop– Stanford Network: 5 hops avg, 1031 byte avg pkt– 31% extra traffic!

Processing Overhead– Packet history assembly and filtering

Storage Overhead

Page 22: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

22

Why generating postcards for every packet at every hop is ^ crazy!

Cost is OK for low-utilization networks– E.g., test networks, “bring-up phase” networks– Single server can handle entire Stanford traffic

not

Page 23: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

23

Why generating postcards for every packet at every hop is ^ crazy!

Huge redundancy in packet header fields– Only a few fields change – IP ID, TCP seq. no.– Postcards can be compressed to 10-20 bytes/pkt

not

Diff-based compression

Page 24: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

24

Why generating postcards for every packet at every hop is ^ crazy!

Postcard processing is embarrassingly parallel– Each packet history can be processed independent

of other packet histories

not

Assembly Filtering

Page 25: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

25

Scaling NetSight Performance

Switch

Switch

Switch

NetSight Server

NetSight Server

NetSight Server

NetSight Server

NetSight Server

NetSight Server

Disk

Disk

Disk

… … … …

Postcards

Shuffle

CompressedPostcard Lists

CompressedPacket Histories

Page 26: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

26

Scaling NetSight Performance

Switch

Switch

Switch

NetSight Server

NetSight Server

NetSight Server

NetSight Server

NetSight Server

NetSight Server

Disk

Disk

Disk

… … … …

Postcards

Shuffle

CompressedPostcard Lists

CompressedPacket Histories

Page 27: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

27

Scaling NetSight Performance

Switch

Switch

Switch

NetSight Server

NetSight Server

NetSight Server

NetSight Server

NetSight Server

NetSight Server

Disk

Disk

Disk

… … … …

Postcards

Shuffle

CompressedPostcard Lists

CompressedPacket Histories

Page 28: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

28

Scaling NetSight Performance

Switch

Switch

Switch

NetSight Server

NetSight Server

NetSight Server

NetSight Server

NetSight Server

NetSight Server

Disk

Disk

Disk

… … … …

Postcards

Shuffle

CompressedPostcard Lists

CompressedPacket Histories

Page 29: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

29

NetSight Variants

Page 30: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

30

NetSight-SwitchAssist moves postcard compression to switches

Switch

Switch

Switch

NetSight Server

NetSight Server

NetSight Server

Disk

Disk

Disk

… … …

Shuffle

Move postcard compression to switcheswith simple hardware mechanisms

Page 31: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

31

NetSight-HostAssist exploits visibility from the hypervisor

Switch

Switch

Switch

NetSight Server

NetSight Server

NetSight Server

Disk

Disk

Disk

… … …

Shuffle

HVPacketHeader

Mini-postcards contain only unique pkt ID and switch state version

(1) Store packet header at the hypervisor(2) Add unique pkt ID

Page 32: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

32

Overhead Reduction in NetSight

Basic (naïve) NetSight : 31% extra trafficin Stanford backbone network

NetSight Switch-Assist: 7%

NetSight Host-Assist: 3%

Page 33: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

33

Takeaways

Complete network visibility is possible– Packet History: a powerful troubleshooting

abstraction that gives complete visibility– NetSight: a platform to capture and filter packet

histories of interestComplete network visibility is feasible – It is possible to collect and filter packet histories at

scale

Page 34: I Know  W hat  Y our Packet Did Last Hop: Using Packet Histories  to Troubleshoot Networks

34

Every

NetSight API

http://yuba.stanford.edu/netsight