network resilience: exploring cascading failures vishal misra columbia university in the city of new...
Post on 20-Dec-2015
217 views
TRANSCRIPT
![Page 1: Network Resilience: Exploring Cascading Failures Vishal Misra Columbia University in the City of New York Joint work with Ed Coffman, Zihui Ge and Don](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d405503460f94a1aa7c/html5/thumbnails/1.jpg)
Network Resilience: Exploring Cascading Failures
Vishal MisraColumbia University in the City of New York
Joint work with Ed Coffman, Zihui Ge and Don Towsley (Umass-Amherst)
![Page 2: Network Resilience: Exploring Cascading Failures Vishal Misra Columbia University in the City of New York Joint work with Ed Coffman, Zihui Ge and Don](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d405503460f94a1aa7c/html5/thumbnails/2.jpg)
Prologue
On Tuesday, September 18, simultaneous with the onset of thepropagation phase of the Nimda worm, we observed a BGP storm.
Thisone came on faster, rode the trend higher, and then, justas mysteriously, turned itself off, though much more slowly. Over aperiod of roughly two hours, starting at about 13:00 GMT (9am EDT),aggregate BGP announcement rates exponentially ramped up by afactor of 25, from 400 per minute to 10,000 per minute, withsustained "gusts" to more than 200,000 per minute. Theadvertisement rate then decayed gradually over many days, reachingpre-Nimda levels by September 24th.
Similar events were observed on July 19th, the day CODE RED spread
http://www.renesys.com/projects/bgp_instability
![Page 3: Network Resilience: Exploring Cascading Failures Vishal Misra Columbia University in the City of New York Joint work with Ed Coffman, Zihui Ge and Don](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d405503460f94a1aa7c/html5/thumbnails/3.jpg)
Conjecture The viruses started random IP port scanning Most of these random IP addresses were not in the cached entries of
the routing table, causing.... frequent cache misses, and.. in the case of invalid IP addresses, generation of ICMP (router error)
messages.. …both of the above causes led to router CPU overload, causing
routers to crash Router failure led to withdrawal announcements by the peers,
generating a high level of advertisement traffic. When the router came back on, it required a full state update from it's
peers, creating a large spike in the load of it's peers that provided the state dump
Once the restarted router obtained all the dumps, it dumped its full state to all its peers, creating another spike in the load..
Frequent full state dumps led to more CPU overload, leading to more crashes, and the propagation of the cycle...
Cascading Failures?
![Page 4: Network Resilience: Exploring Cascading Failures Vishal Misra Columbia University in the City of New York Joint work with Ed Coffman, Zihui Ge and Don](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d405503460f94a1aa7c/html5/thumbnails/4.jpg)
Outline
Background Modeling interactions A Fluid model
Phase transitions A Birth-Death model
More phase transitions Insights Future work
![Page 5: Network Resilience: Exploring Cascading Failures Vishal Misra Columbia University in the City of New York Joint work with Ed Coffman, Zihui Ge and Don](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d405503460f94a1aa7c/html5/thumbnails/5.jpg)
Studies in Cascading Failures
Cascading failures studied extensively in Power Networks (Zaborsky et al.)
Coupling in Power Networks between nodes well understood: e.g. differential equations describe voltage-phasor-load relationships
Coupling in data networks: Routing, Traffic engineering, policy routing, DNS…difficult to model!
![Page 6: Network Resilience: Exploring Cascading Failures Vishal Misra Columbia University in the City of New York Joint work with Ed Coffman, Zihui Ge and Don](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d405503460f94a1aa7c/html5/thumbnails/6.jpg)
Modeling interactions
We model coupling at BGP level Study the interaction of a clique of BGP
routers Model three different kinds of
phenomena: router crash, router repair and full state updates
System essentially forms a mutual aid collective
![Page 7: Network Resilience: Exploring Cascading Failures Vishal Misra Columbia University in the City of New York Joint work with Ed Coffman, Zihui Ge and Don](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d405503460f94a1aa7c/html5/thumbnails/7.jpg)
Clique of routers
•Routers form a fully connected graph•All routers are peers of each other•At the AS level, BGP routers form a clique ofthe order of 540 nodes
![Page 8: Network Resilience: Exploring Cascading Failures Vishal Misra Columbia University in the City of New York Joint work with Ed Coffman, Zihui Ge and Don](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d405503460f94a1aa7c/html5/thumbnails/8.jpg)
A fluid model for interactions
We consider a clique of N nodes Study process of nodes that are down,
D ks : Rate at which single up node brings
up down nodes kl : Rate at which full state updates
brings down up nodes Typically, expect ks >> kl
![Page 9: Network Resilience: Exploring Cascading Failures Vishal Misra Columbia University in the City of New York Joint work with Ed Coffman, Zihui Ge and Don](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d405503460f94a1aa7c/html5/thumbnails/9.jpg)
Drift equations
(t) = Number of arrivals in [0,t)d(t) = (N-D)*D*ksdt
(t) = Number of departures in [0,t)d(t) = D *(N-D) /D kldt = (N-D) *kldt
Now, consider the drift in down nodes DdD(t) = d(t) - d(t)
![Page 10: Network Resilience: Exploring Cascading Failures Vishal Misra Columbia University in the City of New York Joint work with Ed Coffman, Zihui Ge and Don](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d405503460f94a1aa7c/html5/thumbnails/10.jpg)
Dynamics of D
NkDNkkDkdtdD
slsl )(2
System shows Phase TransitionIf D(0) > ks / kl
NtDt )(lim
else
0)(lim tDt
![Page 11: Network Resilience: Exploring Cascading Failures Vishal Misra Columbia University in the City of New York Joint work with Ed Coffman, Zihui Ge and Don](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d405503460f94a1aa7c/html5/thumbnails/11.jpg)
Phase transitions
N = 100ks / kl = 20
![Page 12: Network Resilience: Exploring Cascading Failures Vishal Misra Columbia University in the City of New York Joint work with Ed Coffman, Zihui Ge and Don](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d405503460f94a1aa7c/html5/thumbnails/12.jpg)
Properties of phase transition
Threshold is an absolute quantity rather than a fraction
Cliques with “powerful” (i.e., ks / kl high) nodes do not exhibit cascading failures
Smaller cliques more resistant to phase transitions
![Page 13: Network Resilience: Exploring Cascading Failures Vishal Misra Columbia University in the City of New York Joint work with Ed Coffman, Zihui Ge and Don](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d405503460f94a1aa7c/html5/thumbnails/13.jpg)
A Birth-Death model
Again consider a clique of N nodes The system state i is the number of
down nodes Transitions rates are state dependent
0 1 i i+1 N-1 N
N-1i
i
![Page 14: Network Resilience: Exploring Cascading Failures Vishal Misra Columbia University in the City of New York Joint work with Ed Coffman, Zihui Ge and Don](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d405503460f94a1aa7c/html5/thumbnails/14.jpg)
Transient model
Since N =0, state N is an absorbing state System ends up in N with probability 1 Perform transient analysis, compute mean time
to absorption, Wi starting from state i
Wi good indicator of stability of system, a low value indicates propensity to collapse to state N (where all nodes are down)
Physically, interpret Wi as the ability for the system to recover if it ends up in state i through some exogenous process (e.g. attacks)
![Page 15: Network Resilience: Exploring Cascading Failures Vishal Misra Columbia University in the City of New York Joint work with Ed Coffman, Zihui Ge and Don](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d405503460f94a1aa7c/html5/thumbnails/15.jpg)
Solution for Wi
11
1
11
1
1
iii
ii
ii
ii
ii WWW
With boundary conditions
010
1
WW
211
21
21
1
NN
NNN
NN WW
and
![Page 16: Network Resilience: Exploring Cascading Failures Vishal Misra Columbia University in the City of New York Joint work with Ed Coffman, Zihui Ge and Don](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d405503460f94a1aa7c/html5/thumbnails/16.jpg)
Solution (cont.)
1
0
1
1
1i
j ii
i
jkk
ii WW
and
0NW
Yield a way to compute Wi
![Page 17: Network Resilience: Exploring Cascading Failures Vishal Misra Columbia University in the City of New York Joint work with Ed Coffman, Zihui Ge and Don](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d405503460f94a1aa7c/html5/thumbnails/17.jpg)
Modeling transition rates
i =(N-i) *i *kl + ka
ka =ambient traffic load, kl similar to fluid model
ks similar to fluid model
i =(N-i) *ks
![Page 18: Network Resilience: Exploring Cascading Failures Vishal Misra Columbia University in the City of New York Joint work with Ed Coffman, Zihui Ge and Don](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d405503460f94a1aa7c/html5/thumbnails/18.jpg)
The mean time to absorption
N=20, ks =1, kl=0.01System stable, mean time to absorption of the order 1026 , even if only one node is up
![Page 19: Network Resilience: Exploring Cascading Failures Vishal Misra Columbia University in the City of New York Joint work with Ed Coffman, Zihui Ge and Don](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d405503460f94a1aa7c/html5/thumbnails/19.jpg)
A larger clique
N=100, ks =1, kl=0.01System still stable, mean time to absorption
of the order 1048 , if only one node is up
![Page 20: Network Resilience: Exploring Cascading Failures Vishal Misra Columbia University in the City of New York Joint work with Ed Coffman, Zihui Ge and Don](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d405503460f94a1aa7c/html5/thumbnails/20.jpg)
The appearance of phase transitions
N=200, ks =1, kl=0.01Mean time to absorption goes down from 1047 , to about 0 in a matter of few states
![Page 21: Network Resilience: Exploring Cascading Failures Vishal Misra Columbia University in the City of New York Joint work with Ed Coffman, Zihui Ge and Don](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d405503460f94a1aa7c/html5/thumbnails/21.jpg)
Dependence on service rate/load
Transition point shifts right as ratio goes up
![Page 22: Network Resilience: Exploring Cascading Failures Vishal Misra Columbia University in the City of New York Joint work with Ed Coffman, Zihui Ge and Don](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d405503460f94a1aa7c/html5/thumbnails/22.jpg)
Dependence on clique size
Transition point remains roughly the same, relative stability goes down as N goes up
![Page 23: Network Resilience: Exploring Cascading Failures Vishal Misra Columbia University in the City of New York Joint work with Ed Coffman, Zihui Ge and Don](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d405503460f94a1aa7c/html5/thumbnails/23.jpg)
Early conclusions
Cascading failures possible in mutual support systems like a BGP clique
Presence of phase transitions depends on system parameters strongly
Clique size an important threshold, larger cliques more likely to undergo cascading failures
![Page 24: Network Resilience: Exploring Cascading Failures Vishal Misra Columbia University in the City of New York Joint work with Ed Coffman, Zihui Ge and Don](https://reader030.vdocuments.net/reader030/viewer/2022032704/56649d405503460f94a1aa7c/html5/thumbnails/24.jpg)
Future work
Refine model, plug in numbers for parameters
Look at different topologies Do more detailed modeling of single
router (fixed point solutions)