online testing of bgp marco canini epfl, switzerland work supported by the european research council...
TRANSCRIPT
Marco Canini, RIPE 62 1
Online Testing of BGPMarco Canini
EPFL, Switzerland
Work supported by the European Research Council
Joint work with: Vojin Jovanović, Daniele Venzano, Gautam Kumar, Dejan Novaković, Boris Spasojević, Olivier Crameri, and Dejan Kostić
4/5/2011
NetworkedSystemsLaboratory
Marco Canini, RIPE 62 2
Is it hard to crash the Internet?
• Software bugs in inter-domain routers
Router type A
Router type B
?
0-length AS4_PATH attribute!
Protocol-compliant, confusing message
At 17:07:26 UTC on August 19, 2009 CNCI (AS9354), a small network service provider in Nagoya, Japan, advertised a handful of BGP updates containing an empty AS4_PATH attribute. [renesys blog]
Reset session!
4/5/2011
Marco Canini, RIPE 62 3
Is it hard to crash the Internet?
• What went wrong
Unaffected router
Affected router?
?
?
?
?
Unreachable!
Repeated service disruptions: routing instabilities!
4/5/2011
Marco Canini, RIPE 62 4
BGP not always reliable
• Distributed system behavior– Aggregate result of interleaved actions of multiple
routers– Federated, heterogeneous and failure-prone
environment• Difficult to reason about all corner cases or
combinations of configurations– Unanticipated interactions, subtle differences in
inter-operable implementations, system-wide conflicts, seemingly valid local fault handling
4/5/2011
Marco Canini, RIPE 62 5
Agenda
• Our system for online testing– Disclaimer: still a research work!– Not going to be an immediate solution– Hope it will be a tool for this community
• Solicit feedback– Which faults would you look for?– What would convince you to deploy our system?
• … discussion
4/5/2011
Marco Canini, RIPE 62 6
DiCE comes to the rescue
• Key idea: automatically explore system behavior to detect potential faults1. Create an isolated snapshot of a BGP neighborhood2. Subject a router’s BGP process to many inputs that
systematically exercise router actions3. For each input, check if the snapshot misbehaves
BGP neighbors
BGP process
DiCE Error in the snapshot Evidence of possible future behavior of production system
4/5/2011
BGP snapshot
• Isolate testing from production environmentSpecial IP prefix
Custom attribute
Local checkpoint of current state
and configuration BGP process Cloned BGP process
FIB Sockets BGP peers
Sockets BGP checkpoints
BGP’s federated environment Each router keeps its local checkpoint Private state & config stays in the AS
ASes collaborate to detect potential faults4/5/2011 Marco Canini, RIPE 62 7
Marco Canini, RIPE 62 8
Exploration of behavior
Clone of BGP process
DiCE
Error!
Use a path exploration engine
Concolic (CONCrete + symbOLIC) execution systematically
exercises code pathsIs there an error?
123
4/5/2011
Marco Canini, RIPE 62 9
Driving behavior by inputs
Code & current config
Path exploration engine
Messages FailuresConfiguration changes
Random choicesTimeouts
Input generation
Inputs
Path constraints
UPDATE
Header
Withdrawn Routes
Path Attributes
Attribute Type | Length | Value
Network Layer Reachability InformationNLRI Length | PrefixSymbolic
inputs
Route selection
Route ranking: is most preferred route?
4/5/2011
Marco Canini, RIPE 62 10
Detecting faults
• Check properties that capture desired behavior• Example: Harmful Global Events (session resets)
?
?
?
?
?∑DiCE
controller
f()
f()
f()
f()
f()
f()
f()
f()
f()
1 BGP error
1 BGP error
1 BGP error
1 BGP error
1 BGP error
0
0
0
0
Unaffected router
Affected router
5 BGP errors
Valid but ambiguous messages
Error count > threshold?
Log inputs that have harmful global behavior
4/5/2011
Marco Canini, RIPE 62 11
Other properties
• Policy-induced divergence• Origin misconfiguration
– Check: routing tables polluted in external ASes?• Route leaks (hijacks) by customer or provider
P
Prefix AS_PATH
d X Y Z
C
UPDATEAS_PATH C
prefix d C
List of prefixes that can leak
4/5/2011
Marco Canini, RIPE 62 12
Keeping confidential information
• Potential router behavior– Common code paths already exposed– Reverse engineering any easier than today?
• Private state or configuration– Information hiding through randomization– Avoid inputs driven by confidential data cannot leak
• Rate limit, refuse certain explorers• Anonymous property checks
– Secure multi-party computation no need for trusted 3rd party
4/5/2011
Marco Canini, RIPE 62 13
Implementation details
• Integrated DiCE in BIRD 1.1.7– Open source router, coded in C
• Concolic execution instruments code to track symbolic inputs– Instrumentation needed only for testing– Negligible impact on the production environment
4/5/2011
Marco Canini, RIPE 62 14
Evaluation
• Multiple BIRD instances on a 48-core machine• Properties checked
– Harmful global events– Origin misconfiguration– Policy conflict
4/5/2011
Marco Canini, RIPE 62 15
Evaluation topology [Haeberlen et al., NSDI ’09] + Annotations
• Loaded ~300k BGP prefixes• Replayed 15-min trace • Policy and filtering• Installed in ModelNet
network emulator [OSDI ‘02] – 30 ms intra-AS – 5 ms inter-AS – 620 Mbps
AS 6
AS 165053 AS 8 AS 9 AS 10
AS 5AS 4
AS 2
AS 1
AS 3
Rest of the Internet
customer-provider linkpeering linkbackup linkrouter that resets session due to 0-length AS4_PATH
4/5/2011
Marco Canini, RIPE 62 16
Micro benchmarks
• CPU overhead• Metric: BGP updates per s
– Stress test during RIB load• Baseline: 15.1 – W/ exploration: 13.9 – Impact 8%
– Realistic test during trace replay• Negligible impact
• Memory overhead– Cloned process has 37% overhead on avg
• Bandwidth– 8 Kbps avg for exploratory messaging
4/5/2011
Marco Canini, RIPE 62 17
Results
• Avg: 243 s, 756 explorations– Max 670 s, 2002 explorations– Without ModelNet: avg 155 s– Detected session reset and origin misconfiguration
Explored all paths in the UPDATE handlers + across the Internet-like testbed in ~4 min avg (11 min max)
4/5/2011
Marco Canini, RIPE 62 18
Deployment option 1
• Convince Cisco, Juniper, Huawei, etc. to integrate DiCE
4/5/2011
Marco Canini, RIPE 62 19
Deployment option 2
• Deploy DiCE+BIRD in a server– Potentially run multiple router instances– Configure with the AS policy & BGP feed– Connect with DiCE servers in neighboring ASes
4/5/2011
Marco Canini, RIPE 62 20
Incentives
• Common infrastructure• ISP benefits as an exploration target
– Knowing about its faults• Upstream ISPs can incentivize customer ISPs
to serve as an “explorer”– Fewer faults, lower operational costs
4/5/2011
Marco Canini, RIPE 62 21
Conclusion
• We have an online testing system for BGP• Are you interested to try out our prototype?• Do you have suggestions for properties to
check?– Get in touch: [email protected]
• Thank you! Questions?• More info in our papers
– [LADIS ’10, USENIX ATC ’11]
4/5/2011
Marco Canini, RIPE 62 22
Backup slides
4/5/2011
Marco Canini, RIPE 62 23
My Research
• Improving the reliability of distributed systems• Why?
– Foundation of our society’s infrastructure– ... but it is difficult to make them reliable
• Produce robust design and implementation• Deploy and operate reliably
• A prime example: BGP (inter-domain routing)– Fundamental service for Internet’s operation– Additional challenges: federation & heterogeneity
4/5/2011
DiCE/BGP Prototype in Action
24
Node 2Node 1 (explorer)
1’: fork()
2’: fork()/ run
1’: annotated message
3: message
1: c
reat
e sn
apsh
ot2:
inpu
tco
nstr
aint
s 2’’: connect
4: property check4: check
ctrl
2’’’: fork()/ run
path exploration engine
1’’: fork()
1’’: ack
constraints/inputs
3’: ack
4/5/2011 Marco Canini, RIPE 62
Marco Canini, RIPE 62
Inputs produced by DiCEa.b.c.d/l
Import filter1?
Drop update
Fuzz?
Fuzz attr
Fuzz?
Fuzz attr
fuzz?
Fuzz attr
x.y.z.w/l: (0-length AS4_PATH)
Apply update
Drop update
Send update
x.y.z.w/l: (fuzz)x.y.z.w/l
Original input
Importfilter2?
Apply update
Drop update
Send update
a.b.c.d/l (leaked prefix)
Inpu
t gen
erati
on c
ode
Rout
er u
pdat
e ha
ndlin
g co
de
x.y.z.w/l
Importfilter2?
Apply update
Send update
yes
Import filter1?
yes
Importfilter2?
Import filter1?
yes
254/5/2011
Property 3: BGP Policy Conflicts
Checking convergence is hard [Varadhan et al.,‘96, Griffin et al.,’00]
– Check: Dispute wheel? • Absence of: sufficient condition for robust convergence
[Timothy G. Griffin, Leiden Global Internet talk ‘00]
26
21
0
43
1 3 01 0
2 1 02 0
4 2 0
4 3 0
3 4 2 0
3 0
BAD GADGET II
Nodes locally prefer not routing directly
to 0
Cycle!4/5/2011 Marco Canini, RIPE 62
Dispute Wheel Detection with DiCE
• Use symbolic input to change policy– Can cause a dispute wheel in a single step
• Use global precedence metric to detect and resolve conflict [Ee et al., SIGCOMM ‘07]– Metric invoked DW in the cloned snapshot Fault
27
21
0
43
1 3 01 0 2 1 0
2 0
4 2 0
4 3 0
3 4 2 0
3 0
GOOD GADGET BAD GADGET II
Report:
List of policy changes that cause oscillations
4/5/2011 Marco Canini, RIPE 62