![Page 1: Toward Interactive Debugging for ISP Networks Chia-Chi Lin, Matthew Caesar, Jacobus Van der Merwe § University of Illinois at Urbana-Champaign § AT&T Labs](https://reader035.vdocuments.net/reader035/viewer/2022062618/55150671550346a80c8b565f/html5/thumbnails/1.jpg)
Toward Interactive Debugging for ISP Networks
Chia-Chi Lin†, Matthew Caesar†,Jacobus Van der Merwe§
†University of Illinois at Urbana-Champaign§AT&T Labs – Research
![Page 2: Toward Interactive Debugging for ISP Networks Chia-Chi Lin, Matthew Caesar, Jacobus Van der Merwe § University of Illinois at Urbana-Champaign § AT&T Labs](https://reader035.vdocuments.net/reader035/viewer/2022062618/55150671550346a80c8b565f/html5/thumbnails/2.jpg)
2
Debugging in ISP Networks• Internet: most complex distributed system ever created
– Leads to complex failure modes– Bugs, vulnerabilities, compromise, misconfigurations
• Major challenges in debugging in ISP Networks– Lack of visibility– High rates of change of protocols– Complex interdependencies
• These could cause devastating effects– Long-term outages, slow repair– February 2009 BGP outage
![Page 3: Toward Interactive Debugging for ISP Networks Chia-Chi Lin, Matthew Caesar, Jacobus Van der Merwe § University of Illinois at Urbana-Champaign § AT&T Labs](https://reader035.vdocuments.net/reader035/viewer/2022062618/55150671550346a80c8b565f/html5/thumbnails/3.jpg)
3
Interactive Debugging is Necessary
• Problems exist with fully automated techniques– Focus on detection rather than diagnosis– Modeling could be inexact– Logical and semantic errors seems to require
human knowledge to solve• Our position:
– Humans must be “in-the-loop”– Tools are required to facilitate the process
![Page 4: Toward Interactive Debugging for ISP Networks Chia-Chi Lin, Matthew Caesar, Jacobus Van der Merwe § University of Illinois at Urbana-Champaign § AT&T Labs](https://reader035.vdocuments.net/reader035/viewer/2022062618/55150671550346a80c8b565f/html5/thumbnails/4.jpg)
4
A Scenario
ISP
Customer
Pause when the outage
occurs
Cloned Network
![Page 5: Toward Interactive Debugging for ISP Networks Chia-Chi Lin, Matthew Caesar, Jacobus Van der Merwe § University of Illinois at Urbana-Champaign § AT&T Labs](https://reader035.vdocuments.net/reader035/viewer/2022062618/55150671550346a80c8b565f/html5/thumbnails/5.jpg)
5
Our Vision• Isolation of the operational network
– Prevent diagnostic procedure from interfering with live network operation
– Solution: virtualization technologies• Reproducibility of network execution
– Enable operator to replay execution, narrow in on rare events– Solution: instill a pseudorandom ordering over events, messages
• Interactive stepping through execution– Operator can slowly step through operation, trace messages– Solution: protocols providing tight control over distributed execution
![Page 6: Toward Interactive Debugging for ISP Networks Chia-Chi Lin, Matthew Caesar, Jacobus Van der Merwe § University of Illinois at Urbana-Champaign § AT&T Labs](https://reader035.vdocuments.net/reader035/viewer/2022062618/55150671550346a80c8b565f/html5/thumbnails/6.jpg)
6
The ArchitectureVirtual Service Platforms
Virtual Service Coordinator
Physical Network Node
DebuggingCoordinator
Virtual Service Nodes
User (human troubleshooter)
Physical Network Infrastructure
Application 1: e.g. BGPApplication
2: e.g. OSPF
![Page 7: Toward Interactive Debugging for ISP Networks Chia-Chi Lin, Matthew Caesar, Jacobus Van der Merwe § University of Illinois at Urbana-Champaign § AT&T Labs](https://reader035.vdocuments.net/reader035/viewer/2022062618/55150671550346a80c8b565f/html5/thumbnails/7.jpg)
7
Key Challenge: Reproducibility• Reproducibility simplifies interactive debugging
– Can run multiple times, varying inputs to narrow down cause– When rare bug occurs, don’t need to wait for it to reoccur
• One option: generate comprehensive logs of all events– e.g., log all packet sends/receives, all data– Problem: not scalable to large networked software
• Our approach: eliminate randomness in execution– Starting with the same initial state will produce same execution– Make execution “pseudorandom” to explore different execution paths– Key challenge: how to eliminate randomness in large-scale software
execution?
![Page 8: Toward Interactive Debugging for ISP Networks Chia-Chi Lin, Matthew Caesar, Jacobus Van der Merwe § University of Illinois at Urbana-Champaign § AT&T Labs](https://reader035.vdocuments.net/reader035/viewer/2022062618/55150671550346a80c8b565f/html5/thumbnails/8.jpg)
8
An Algorithm for Distributed, Reproducible Execution
• Approach:– Encapsulate software in virtual environment– Intercept software’s inputs/outputs, instill an ordering over them– Make sure that ordering is the same, every time software is run
• How this is done:– Network is run in lockstep fashion– On every cycle: messages from neighbors are buffered– Before deliver to application, pseudorandom ordering is instilled by
consistent hash of packet’s contents– Human sends “step” commands to move to next lockstep cycle
![Page 9: Toward Interactive Debugging for ISP Networks Chia-Chi Lin, Matthew Caesar, Jacobus Van der Merwe § University of Illinois at Urbana-Champaign § AT&T Labs](https://reader035.vdocuments.net/reader035/viewer/2022062618/55150671550346a80c8b565f/html5/thumbnails/9.jpg)
9
Improving Performance for the Production Network
• Problem: running application in lockstep fashion slows operation– Might be okay for some protocols (e.g., BGP)– Probably not okay for others (e.g., OSPF)
• Solution: “optimistic” execution of events– Choose pseudorandom ordering in advance that is likely to
happen anyway– Don’t buffer packets, deliver them immediately– If we guess wrong, roll back application to earlier state
![Page 10: Toward Interactive Debugging for ISP Networks Chia-Chi Lin, Matthew Caesar, Jacobus Van der Merwe § University of Illinois at Urbana-Champaign § AT&T Labs](https://reader035.vdocuments.net/reader035/viewer/2022062618/55150671550346a80c8b565f/html5/thumbnails/10.jpg)
10
Example: Running the Lockstep Algorithm in a Cloned Network
App
App
App
App
TransmissionPhase
ProcessingPhase
I finished transmitting.I am ready to process.
K
L
S
A
AK
L
S
S LK
A
I finished processing.I am ready to transmit.
App
App
App
App
App
Sending Buffer
Receiving Buffer
1. S2. L3. K4. ……
![Page 11: Toward Interactive Debugging for ISP Networks Chia-Chi Lin, Matthew Caesar, Jacobus Van der Merwe § University of Illinois at Urbana-Champaign § AT&T Labs](https://reader035.vdocuments.net/reader035/viewer/2022062618/55150671550346a80c8b565f/html5/thumbnails/11.jpg)
11
Example: Live Algorithm in Production Network
10
1413
13
16
8
11
6
107
39
14
Seattle
Los Angeles
Salt Lake City
Kansas City
Houston
Atlanta
New York
Washington
Chicago
The live algorithm does two things:• Determine the ordering of events• Roll back events violating the ordering
Packets from Seattle should come before
those from Los Angeles
1. Seattle2. Los Angeles3. Kansas City4. Chicago5. ……
S
K
C
L
S K CL K C
K C
Pseudorandom ordering is violated!
![Page 12: Toward Interactive Debugging for ISP Networks Chia-Chi Lin, Matthew Caesar, Jacobus Van der Merwe § University of Illinois at Urbana-Champaign § AT&T Labs](https://reader035.vdocuments.net/reader035/viewer/2022062618/55150671550346a80c8b565f/html5/thumbnails/12.jpg)
12
Connecting the Two Algorithms
• We can run the production network using the live algorithm– Achieves a fixed ordering over messages– But how to actually debug it?
• Solution: replay using the lockstep algorithm– First let the production network run, checkpoint starting
state– To debug, start lockstep algorithm with same staring state– Lockstep algorithm will traverse the same execution
• Can replay multiple times, narrow in on problem, experiment by changing inputs, etc.
![Page 13: Toward Interactive Debugging for ISP Networks Chia-Chi Lin, Matthew Caesar, Jacobus Van der Merwe § University of Illinois at Urbana-Champaign § AT&T Labs](https://reader035.vdocuments.net/reader035/viewer/2022062618/55150671550346a80c8b565f/html5/thumbnails/13.jpg)
13
Simulation Settings
• Protocol evaluated: OSPF• Topologies used: BRITE, Internet2 backbone• Link delay model: 1 ms + (0, 0.5] exponentially
distributed random delay• Events simulated: Abilene IS-IS traces over the
month of January 2009 (giving 209 events)• Measure performance overheads of our
approach
![Page 14: Toward Interactive Debugging for ISP Networks Chia-Chi Lin, Matthew Caesar, Jacobus Van der Merwe § University of Illinois at Urbana-Champaign § AT&T Labs](https://reader035.vdocuments.net/reader035/viewer/2022062618/55150671550346a80c8b565f/html5/thumbnails/14.jpg)
14
Results – Overhead in Production Networks
• Live algorithm suffers from rollbacks, incurring 4x inflation in traffic overhead
• Using delay-estimation optimization reduces overhead to 0.02x traffic inflation
![Page 15: Toward Interactive Debugging for ISP Networks Chia-Chi Lin, Matthew Caesar, Jacobus Van der Merwe § University of Illinois at Urbana-Champaign § AT&T Labs](https://reader035.vdocuments.net/reader035/viewer/2022062618/55150671550346a80c8b565f/html5/thumbnails/15.jpg)
15
Results – Response Time in Cloned Networks
• Low response time is beneficial to interactive debugging
• Response time is low for variety of network sizes
![Page 16: Toward Interactive Debugging for ISP Networks Chia-Chi Lin, Matthew Caesar, Jacobus Van der Merwe § University of Illinois at Urbana-Champaign § AT&T Labs](https://reader035.vdocuments.net/reader035/viewer/2022062618/55150671550346a80c8b565f/html5/thumbnails/16.jpg)
16
Conclusion
• Humans are required to be “in-the-loop” to diagnose problems
• Our architecture is a first step towards interactive debugging– Builds on known techniques, e.g., virtualization
technologies and distributed semaphores– Develop techniques to reproduce distributed executions
• Simulations on real-world events show the scheme accompanied with low overheads
![Page 17: Toward Interactive Debugging for ISP Networks Chia-Chi Lin, Matthew Caesar, Jacobus Van der Merwe § University of Illinois at Urbana-Champaign § AT&T Labs](https://reader035.vdocuments.net/reader035/viewer/2022062618/55150671550346a80c8b565f/html5/thumbnails/17.jpg)
17
![Page 18: Toward Interactive Debugging for ISP Networks Chia-Chi Lin, Matthew Caesar, Jacobus Van der Merwe § University of Illinois at Urbana-Champaign § AT&T Labs](https://reader035.vdocuments.net/reader035/viewer/2022062618/55150671550346a80c8b565f/html5/thumbnails/18.jpg)
18
The State of the Art: Automated Techniques
• Logging observations– X-Trace, Friday, etc.
• Model checking– rcc, OD flow, etc.
• Debugging standalone programs– Coverity, AVIO, etc.
![Page 19: Toward Interactive Debugging for ISP Networks Chia-Chi Lin, Matthew Caesar, Jacobus Van der Merwe § University of Illinois at Urbana-Champaign § AT&T Labs](https://reader035.vdocuments.net/reader035/viewer/2022062618/55150671550346a80c8b565f/html5/thumbnails/19.jpg)
19
Optimized Ordering in the Production Network
• Goal: avoid rollbacks by selecting ordering likely to happen anyway– Events separated by long period will fall into different groups which
means ordering is easy– Problem: some failure events are correlated
• E.g., multiple overlay links sharing same physical link
– How to order events in same group?• Solution: if we know link delays, we can reliably estimate
expected arrival of events– In practice we don’t know exact link delays– But we can estimate them– Can improve estimation by giving protocol messages high priority
![Page 20: Toward Interactive Debugging for ISP Networks Chia-Chi Lin, Matthew Caesar, Jacobus Van der Merwe § University of Illinois at Urbana-Champaign § AT&T Labs](https://reader035.vdocuments.net/reader035/viewer/2022062618/55150671550346a80c8b565f/html5/thumbnails/20.jpg)
20
Results – Storage in Production Network
• State required for rolling back packets is small and increases slowly with network size