t-110.5116 computer networks ii€¦ · snmp traps, syslog msgs, trouble tickets, traffic traces...
TRANSCRIPT
T-110.5116 Computer Networks II Network diagnostics
07.11.2010 Matti Siekkinen
(Sources: R.Teixeira: Internet measurements: fault detection, identification, and topology discovery; S. Kandula: Detailed Diagnosis in Enterprise Networks)
November 7, 11
Outline
Network diagnostics What, why, and how?
A view from the edge Topology discovery Bandwidth measurements Network Tomography
A view from the inside Detailed diagnostics and root cause analysis Traffic anomaly detection
Conclusions
November 7, 11
Network diagnostics
Understand how the network is doing Detect and diagnose faults (links, routers, …) Identify performance bottlenecks
o E.g. congested link Detect and quarantine misbehaving devices or
traffic Anomaly detection E.g. misconfigured router, attacks
3
November 7, 11
Why bother?
Keep things going Stuff breaks down Operators and admins are human beings and make
mistakes Want to keep the networks operational
Maximum benefit out of the infrastructure Equipment costs money Maximize utilization
Happy customers Performance troubles make them unhappy Unhappy customers may decrease revenues
4
November 7, 11
Why is it challenging?
Few built-in diagnosis mechanisms Today’s networks run on IP Network elements are “simple” Intelligence lies at the edges ⇒ May need to use complex end-to-end methods to
measure simple things (e.g. link capacity) Scale can be very large
Traffic volumes Number of nodes Different services and protocols ⇒ Diagnosis techniques need to be scalable too
5
November 7, 11 6
Diagnosing networks
Obtain some input data SNMP traps, syslog msgs, trouble tickets, traffic
traces etc.
Inference / Analysis Analyze the input data E.g. learn that a router link is down
Do something about it E.g. start fixing the link
Collect raw measurements
Analyze measurements
Use learned information
November 7, 11 7
Ways to collect data for diagnosis
Management tools Ask the devices how they are doing Receive alarms, traps E.g. SNMP
Passive measurements Simply record what you observe E.g. Cisco’s Netflow traffic data or raw traffic
header traces Active measurements
Send probes and observe what happens to them E.g. tomography, bandwidth measurements
November 7, 11 8
Network aggregation points Router, switch Access, gateway, backbone Depends on scale, available
methods, and objectives Client or server
Usually limited possibilities Possible in data center networks
Where to collect measurements?
Backbone router
Access router
Customers Customers
Gateway router
ISP 2 ISP 3
ISP 1
November 7, 11 9
Analyzing data
On-line Perform (at least a part of) the analysis on the observed data in
a real-time manner ☺ Data reduction -> don’t store everything ☺ Can react quickly ☹ Scalability
• Intenet2 Network backbone link (10 Gbit/s) produces >8 MB/s of uncompressed packet headers
• May need sampling, aggregation ☹ Do not necessarily have all the raw data for later analysis
Off-line Record data into persistent storage and analyze later ☺ Run complex time-consuming analysis ☹ Not for time critical analysis ☹ Storage issues
November 7, 11
Analyzing data (cont.)
Human vs. machine Statistical analysis and data mining techniques Reveal non-trivial patterns (aggregate/similar behavior,
anomalies) Still need an admin/operator somewhere in the loop
Combine many data sources Increase robustness
o Fewer false positives Detect issues that would normally “fly under the radar”
o Aggregated input feeds may reveal more
10
November 7, 11
Analyzing data (cont.)
Why not just rely on alarms and traps (e.g. SNMP)? Network “Black holes”
o Silent failures: nw devices do not send alarms o Causes: complex cross-layer interactions, router sw bugs/
misconfigurations, … Need for more detailed, application-specific diagnosis Want insight into the causes of failures/problems that
raise alarms Diagnosis across administrative domains
11
November 7, 11
Outline
Network diagnostics What, why, and how?
A view from the edge Topology discovery Bandwidth measurements Network Tomography
A view from the inside Detailed diagnostics and root cause analysis Traffic anomaly detection
Conclusions
November 7, 11
A view from the edge
Measurements and diagnosis done from the end points Not a strict classification of methodology…
Rely on Probing messages analyzed at the other end Clever use of standard protocols
o ping, traceroute o Cannot use custom protocols since routers won’t react to and/
or forward them Three example cases
Topology discovery Bandwidth measurements Network tomography
o Topology information needed
13
November 7, 11
Topology discovery Topology describes how the network is laid out
Links between routers, switches, etc. Not trivial knowledge in large scale networks
Why do this? Some diagnosis methods rely on accurate topology
information o Network tomography
Realistic simulation and modeling of the Internet o Topology models needed for simulations o Correctness of network protocols typically independent
of topology o Performance of networks critically dependent on
topology • e.g., convergence of route information
14
November 7, 11
Measuring topology Different vantage points
With access to routers (or “from inside”) o Topology of one network o Routing monitors (OSPF or IS-IS)
No access to routers (or “from outside”) o Multi-AS topology or from end-hosts o Monitors issue active probes: traceroute
Different granularity levels Router-level topologies
o Reflect physical connectivity between nodes o Inferred using with e.g. traceroute
AS graphs o Peering relationships between providers/clients o Inferred from inter-domain routers’ BGP tables o Could also use traceroute with some additional information
15
November 7, 11 16
Topology from inside
Routing protocols flood state of each link Periodically refresh link state Report any changes: link down, up, cost change
Monitor listens to link-state messages Acts as a regular router
o AT&T’s OSPFmon or Sprint’s PyRT for IS-IS Combining link states gives the topology
Easy to maintain, messages report any changes Usually not possible across domains
November 7, 11
Inferring a path from outside: traceroute
17
A B
TTL = 1
A.1 A.2 B.2 B.1
TTL = 2
TTL exceeded from A.1
TTL exceeded from B.1
Actual path
Inferred path
A.1 B.1
m t
m t
November 7, 11
A traceroute path can be incomplete
Load balancing is widely used Traceroute only probes one path
Sometimes traceroute has no answer (stars) ICMP rate limiting for DoS protection Anonymous routers
o Do not send ICMP replies at all or reply with probe’s destination IP
o Security and privacy concerns Tunnelling (e.g., MPLS) may hide routers
Routers inside the tunnel may not decrement TTL
18
November 7, 11 19
Traceroute under load balancing
L
B
A C
D
L
A
D
C
TTL = 2
TTL = 3
B
E
E
Missing nodes and links
False link
Actual path
Inferred path
m
m t
t
November 7, 11 20
L
B
A C
D
TTL = 2 Port 2
TTL = 3 Port 3
E
Even per-flow load balancing causes trouble Traceroute uses the destination port as identifier
Needs to match probe to response Response only has the header of the issued probe
m t
Traceroute under load balancing (cont.)
November 7, 11 21
Paris traceroute Solves the problem with per-flow load balancing
Probes to a destination belong to same flow Keep flow IDs constant for probes to specific destination
Flow ID = src/dest IP & port, TP protocol How to match probes with ICMP responses?
Vary fields within first eight octets of TP-layer header (included in ICMP response)
Keep the flow ID related fields constant UDP probes: vary checksum (need to manipulate payload too) ICMP probes: vary #seq, but also Identifier -> keep checksum constant
L
B
A C
D
TTL = 2 Port 1
TTL = 3 Port 1 E Checksum 3 Checksum 2
m t More details in: B. Augustin et al: "Avoiding traceroute anomalies with Paris traceroute”. IMC 2006.
November 7, 11
4 2 1
1
More traceroute shortcomings
Inferred nodes = interfaces, not routers Different interfaces have different IP address
Coverage depends on monitors and targets Misses links and routers Some links and routers appear multiple times
22
1 A
D
3 B 2
3
2
3 1 m1
t1
m2
t2 C
Actual topology
A.1 m1 t1
m2 t2
Inferred topology
C.1 D.1
C.2
B.3
2
November 7, 11
Alias resolution: Map interfaces to routers
Direct probing Responses from the same router
will have close IP identifiers and same TTL
N. Spring et al: “Measuring ISP Topologies with Rocketfuel”, SIGCOMM 2002.
Record-route IP option Records up to nine IP addresses of
routers in the path Can also discover outgoing
interfaces R. Sherwood et al: “DisCarte: A
Disjunctive Internet Cartographer”, SIGCOMM, 2008.
23
A.1 m1 t1
m2 t2
Inferred topology
C.1 D.1
C.2
B.3 same router
November 7, 11
Large-scale topology measurements
Probing a large topology takes time E.g., probing 1200 targets from PlanetLab nodes takes 5
minutes on average (using 30 threads) Probing more targets covers more links But, getting a topology snapshot takes longer
Snapshot may be inaccurate Paths may change during snapshot To know that a path changed, need to re-probe
Make more feasible by reducing redundant probing Aggregation points in network B. Donnet et al.: “Efficient Algorithms for Large-Scale Topology
Discovery”. In SIGMETRICS 2005.
24
November 7, 11
Outline
Network diagnostics What, why, and how?
A view from the edge Topology discovery Bandwidth measurements Network Tomography
A view from the inside Detailed diagnostics and root cause analysis Traffic anomaly detection
Conclusions
November 7, 11
Bandwidth measurements
What? Infer the bandwidth of a specific hop or of a whole path Capacity = maximum possible throughput Available bandwidth = portion of capacity not currently
used Bulk transfer capacity = throughput that a new single
long-lived TCP connection could obtain Why?
Network aware applications o Server or peer selection o Route selection in overlay networks
QoS verification
26
November 7, 11 27
Challenges
Routers and switches do not provide direct feedback to end-hosts (except ICMP, also of limited use) Mostly due to scalability, policy, and simplicity reasons
Network administrators can read router/switch information using SNMP protocol
End-to-end bandwidth estimation cannot be done in the above way No access because of administrative barriers
November 7, 11 28
The Internet is a “black box”
The Internet
End-systems can infer network state through end-to-end (e2e) measurements Without any feedback from routers Objectives: accuracy, speed, minimal intrusiveness
Probing packets
November 7, 11 29
Simple example of an end-to-end path
Metrics and definitions
router1
cross traffic
link1 (access link) router2
cross traffic
link2
source host
destination host
link3 (access link)
November 7, 11 30
Metrics and definitions (cont.)
Capacity of this path is 100 Mbps Determined by the narrow link
Available bandwidth of this path is 50 Mbps Determined by the tight link
narrow link tight link
100 Mbps 90 Mbps 10 Mbps
2500 Mbps 1300 Mbps 1200 Mbps
1000 Mbps 50 Mbps 950 Mbps
link capacity
available bandwidth used bandwidth
link1 link3 link2
November 7, 11
Measurement techniques Generally use active probing
Send packets with a specific inter-arrival pattern
Observe the pattern at the other end
Example: Packet-pair technique for capacity estimation Send two equal-sized packets
back-to-back o Packet size: L o Packet tx time at link i: L/Ci
P-P dispersion: time interval between first bit of two packets
Without any cross traffic, the dispersion at receiver is determined by narrow link:
31
path capacity
November 7, 11 32
Bandwidth estimation: CapProbe
Rohit Kapoor et al.: CapProbe: A Simple and Accurate Capacity Estimation Technique (SIGCOMM 2004)
CapProbe is a capacity estimation tool Takes into account effect of cross-traffic Cross traffic packets can affect P-P dispersion
P-P expansion: capacity underestimation P-P compression: capacity overestimation
Noise in P-P distribution depends on cross traffic load
November 7, 11 33
No cross-traffic
Capacity = (Packet Size) / (Dispersion)
CapProbe: Ideal Packet Dispersion
November 7, 11 34
Cross-traffic (CT) serviced between PP packets Second packet queues due to Cross Traffic (CT )=> expansion of
dispersion =>Under-estimation
CapProbe: Expansion of Dispersion
November 7, 11 35
First packet queueing => compressed dispersion => Over-estimation
CapProbe: Compression of Dispersion
November 7, 11 36
CapProbe: The approach
Observations: First packet queues more than the second
o Compression o Over-estimation
Second packet queues more than the first o Expansion o Under-estimation
Both expansion and compression are the result of probe packets experiencing queuing
o Sum of PP delay includes queuing delay Filter PP samples that do not have minimum queuing time Dispersion of PP sample with minimum delay sum reflects
capacity
November 7, 11 37
CapProbe Observation
For each packet pair, CapProbe calculates delay sum: delay(packet_1) + delay(packet_2)
A PP with the minimum delay sum points out the capacity
capacity
November 7, 11 38
Bandwidth estimation tools
Many estimation tools & techniques Abing, netest, pipechar, STAB, pathneck, IGI/PTR, abget,
Spruce, pathchar, clink, pchar, PPrate, DSLprobe, ABwProbe, …
Some practical issues Traffic shapers Non-FIFO queues
More scalable methods Passive measurements instead of active measurements
o E.g. PPrate (2006) for capacity estimation: adapt Pathrate’s algorithm
One measurement host instead of two cooperating ones o abget (2006) for available bandwidth estimation o DSLprobe for capacity estimation of asymmetric (ADSL) links
November 7, 11
Outline
Network diagnostics What, why, and how?
A view from the edge Topology discovery Bandwidth measurements Network Tomography
A view from the inside Detailed diagnostics and root cause analysis Traffic anomaly detection
Conclusions
November 7, 11
Different viewpoints of the network
Network operators only have data of one AS AS4 doesn’t detect any problem AS3 doesn’t know who is affected by the failure
End-hosts don’t know what happens in the network Can only monitor end-to-end paths
40
AS1
AS2 AS3
AS4
November 7, 11
Network Tomography
41
View network as “black box” Probe network from the edge Analogy to medical imaging with x-
rays Fault diagnosis with tomography
Diagnose persistent reachability problems across domains
Useful to detect black holes o Silent failures that no not produce
alerts! Two phase approach
Detect: End-to-end path monitoring Localize: Binary tomography
November 7, 11
Outline
Network diagnostics What, why, and how?
A view from the edge Topology discovery Bandwidth measurements Network Tomography
o Fault detection o Fault localization
A view from the inside Detailed diagnostics and root cause analysis Traffic anomaly detection
Conclusions
November 7, 11
Fault detection: end-to-end monitoring
43
What is monitored? Different properties of network links Loss rate, delay, bandwidth, connectivity
Who monitors? Network operators
o In network monitoring hosts o Third-party monitoring services o From home gateways
End users o Cooperative monitoring o Users of popular services/applications
November 7, 11
Fault detection: end-to-end monitoring (cont.)
44
probe sender
probe collectors
How monitored? End-to-end: from probe senders to
collectors No access to routers Using multicast probes would be
efficient o IP multicast not deployed in practice
November 7, 11
Monitoring techniques
Active probing: ping Send probe, collect response From any end host
o Works for network operators and end users
Passive analysis of user’s traffic Tap incoming and outgoing traffic
o At user’s machines or servers: tcpdump, pcap o Inside the network: DAG card
Monitor status of TCP connections
45
November 7, 11
Fault detection with ping
If receives reply Then, path is good
If no reply before timeout Then, path is bad
46
m
t probe ICMP
echo request
reply ICMP
echo reply
November 7, 11
Persistent failure or measurement noise?
Many reasons to lose probe or reply Timeout may be too short Rate limiting at routers Some end-hosts don’t respond to ICMP request Transient congestion Routing change
Need to confirm that failure is persistent Otherwise, may trigger false alarms
47
November 7, 11
Failure confirmation Upon detection of a failure, trigger extra probes Two main parameters
Number of probes o Too large nb perturbs path and may cause additional losses
Time between probes o Periodic? Which interval?
Parameter values determine the robustness and reactiveness Make sure that indeed we observe a persistent failure Do not spend too long in confirming the failure
48
time
loss burst packets on
a path
false positive More details in: I. Cunha et al: “Measurement Methods for Fast and Accurate Blackhole Identification with Binary Tomography”. IMC 2009.
November 7, 11
Passive detection
At end hosts tcpdump/pcap captures packets Track status of each TCP connection
o RTTs, timeouts, retransmissions Multiple timeouts indicate path is bad
Inside the network it is more challenging Traffic volume is high
o Need special hardware • DAG cards can capture packets at high speeds
o May lose packets Tracking TCP connections is hard
o May not capture both sides of a connection o Large processing and memory overhead
49
November 7, 11
Passive vs. active detection Passive
+ No need to inject traffic
+ Detects all failures that affect user’s traffic
+ Responses from targets that don’t respond to ping
Active
+ No need to tap user’s traffic
+ Detects failures in any desired path
+ Fast detection
50
‒ Not always possible to tap user’s traffic ‒ Only detects failures in paths with traffic
‒ Probing overhead ‒ Deployment overhead
November 7, 11
Aggregation: motivation
Lack of synchronization leads to inconsistencies Probes cross links at different times Path may change between probes
Measurements from a single monitor Probing all targets can take time
Measurements from multiple monitors Hard to synchronize monitors for all
probes to reach a link at the same time Impossible to generalize to all links
51
m
t1 t2
mistakenly inferred failure
November 7, 11
Aggregation: motivation (cont.)
52
m1
t1 tN
mK
mK,t1
mK,tN
…
m1,t1
m1,tN
…
path reachability
good
good
…
good
bad …
inconsistent measurements
November 7, 11
Aggregation: Basic idea
53
Consistency has a cost Delays fault localization Cannot identify short failures
m1
t1 tN
mK
mK,t1
mK,tN …
m1,t1
m1,tN
…
path reachability
good
bad
…
good
bad
…
Reprobe paths after failure
November 7, 11
Aggregation Aggregation: building reachability matrix out of path
measurements Three strategies
Aggregation always triggered by a path status change BASIC
o Build reachability matrix after next full monitoring cycle C o Consistency problems remain
MC (multi-cycle) o Wait n cycles until reach identical measurements o High consistency in the expense of increased delay
• e.g. frequent detection errors MC-PATH (multi-cycle noise tolerant)
o Strikes a balance between BASIC and MC o Fix n and do not require identical status for all paths o Unstable paths marked as “up”
54
November 7, 11
Aggregation: tradeoff
Consistency vs. localization speed Faster localization leads to false alarms Slower localization misses short failures
Network operators Too many false alarms are unmanageable Longer failures are the ones that need intervention
End users Even short failures affect performance
55
More details in: I. Cunha et al: “Measurement Methods for Fast and Accurate Blackhole Identification with Binary Tomography”. IMC 2009.
November 7, 11
Outline
Network diagnostics What, why, and how?
A view from the edge Topology discovery Bandwidth measurements Network Tomography
o Fault detection o Fault localization
A view from the inside Detailed diagnostics and root cause analysis Traffic anomaly detection
Conclusions
November 7, 11
Fault localization: Binary tomography
Labels paths as good or bad Loss-rate estimation requires tight
correlation Instead, separate good/bad performance If link is bad, all paths that cross the
link are bad Find the smallest set of links that
explains bad paths Given bad links are uncommon Bad link is the root of maximal bad
subtree
57
m
t1 t2
bad
1 1 0 1 0 1
good bad
November 7, 11
Binary tomography in practice Multiple sources and targets Problem becomes NP-hard
Minimum hitting set problem Iterative greedy algorithm
Given the set of links in bad paths Iteratively choose link that explains the
max number of bad paths Further issues
Topology may be unknown o Need to measure accurate topology
Multicast not available o Need to extract correlation from unicast
probes
58
m2
t1 t2
m1
Hitting set of link = paths that traverse
the link
November 7, 11
Greedy approximation
Method based on spatial correlation Intersect set of OD-pairs that experienced failures to
discover shared links Shared links are most likely explanation -> hypothesis
Apply greedy approximation approach Can have single, dual, multiple failures Unfeasible to explore all possibilities Just look for the most likely explanation
59
{G-H} as the hypothesis for the black hole
November 7, 11
Greedy approximation (cont.)
Failure detection generates a failure signature Set of lost probes for all OD-pair A.k.a. reachability matrix
MAX-COVERAGE algorithm Pick link that explains most nb of observations Add link to hypothesis Remove corresponding observations from failure signature Repeat above iteratively until no more observations in failure
signature remains
Set of OD-pairs in failure signature
Failed links
November 7, 11
Fault Localization: greedy algorithm How to select the candidate link within hypothesis?
ABSOLUTE: pick links that explain at least n observations
RELATIVE: pick links that explain at least fraction f of observations
Pros & cons Avoid having to explore all possibilities Can localize multiple failures Bias in favor of links present in many paths
More details in: R. R. Kompella, J. Yates, A. Greenberg, A. C. Snoeren, “Detection and Localization of Network Blackholes”, IEEE INFOCOM, 2007.
November 7, 11
Outline
Network diagnostics What, why, and how?
A view from the edge Topology discovery Bandwidth measurements Network Tomography
A view from the inside Detailed diagnostics and root cause analysis Traffic anomaly detection
Conclusions
November 7, 11
View from the inside
Need for more detailed diagnostics Tomography is too coarse grained Basic SNMP-style does not reveal root causes
Trade off some scale for detail Focus on (small scale) enterprise networks
Example case: Diagnosis system called NetMedic
63
November 7, 11
Small enterprise network troubleshooting
Requirements Handle app-specific as well as generic faults Identify culprits at a fine granularity
Small enterprises have.. less sophisticated admins less rich connectivity many shared components
Scale is not huge -> can afford detailed diagnosis
64
November 7, 11
Tradeoff between detail and scale
Requirements for dependency graph based formulations Model the network as a dependency graph at a
fine grained level o Not just machine-level o Not just faulty or healthy
Complex dependency model o Allow circular and direct/indirect mutual
dependencies Trade off some scalability
65
November 7, 11
Example problem 1: Server misconfig
Software update had changed Web server config Incorrect processing of some scripts Operator aware of update, not of config change
Web server
Browser
Browser
Server config
November 7, 11
Example problem 2: Buggy client
SQL server
SQL client C2
SQL client C1
Requests
C1 experienced slow performance from SQL due to misbehaving C2
November 7, 11
Example problem 3: Client misconfig
Exchange server
Outlook config
Outlook config
Client config was overridden with incorrect mail server type Bug in client sw triggered by
overnight update
November 7, 11
A formulation for detailed diagnosis Challenge: application specific
diagnosis in application agnostic way Point solutions are easy but
unfeasible Dependency graph of fine-grained
components Process, OS, configs, nw path, … Process depends on its OS,
configuration on the application it is running
Snoop socket-level events to connect communicating processes
Component state is a multi-dimensional vector Application dependent and
independent variables Diagnostic system unaware of
semantics Variables exposed by OS and apps
o E.g. Win. Perf. Counter, /proc/
SQL svr
Exch.svr IIS
svr
IIS config
Process OS Config
SQL client
C1
SQL client
C2
% CPU &me IO bytes/sec
Connec&ons/sec 404 errors/sec
November 7, 11
The goal of diagnosis
Svr
C1
C2
Identify likely culprits for components of interest
Without using semantics of state variables No application knowledge
Process OS Config
November 7, 11
Using joint historical behavior to estimate impact
D S
d0a d0
b d0c s0
a s0b s0
c s0d
dna dn
b dnc
. . .
. . .
. . .
. . .
. . . d1
a d1b d1
c
sna sn
b snc sn
d . . . . . . . . . . . . . . . . . . . .
s1a s1
b s1c s1
d
Identify time periods when state of S was “similar”
How “similar” on average states of D are at those times
Svr C1
C2
Request rate (low) Response time (high)
Request rate (high) Response time (high)
Request rate (high)
H
HL
If S in a previously unseen state, conservatively estimate the impact to be high
November 7, 11
Further details Ranking of likely culprits
Mainly using dependency graph’s path weight o Geometric mean of edge weights
Evaluated with a quite small setup Works but scalability to large environments may be an issue Netmedic needs a ~60 mins of history
Two sub-problems at the intersection with HCI Visualizing complex analysis (NetClinic) Intuitiveness of analysis
More details in: Srikanth Kandula et al: “Detailed diagnosis in enterprise networks”. In SIGCOMM 2009. (Microsoft Research)
November 7, 11
Outline
Network diagnostics What, why, and how?
A view from the edge Topology discovery Bandwidth measurements Network Tomography
A view from the inside Detailed diagnostics and root cause analysis Traffic anomaly detection
Conclusions
November 7, 11
Anomaly detection
Study abnormal traffic Non-productive traffic, a.k.a. Internet “background radiation” Traffic that is malicious (scans for vulnerabilities, worms) or
mostly harmless (misconfigured devices) Network troubleshooting
Identify and locate misconfigured or compromised devices Intrusion detection
Identify malicious activity before it hits you Analyze traffic for attack signatures
Characterizing malicious activities Honeypot: a host that sits and waits for attacks
o Learn how attackers probe for and exploit a system Network telescope: portion of routed IP address space on
which little or no legitimate traffic exists
74
November 7, 11
Anomaly detection techniques
Typical procedure Learn underlying models of normal traffic
o May require long time (days or weeks) Detect deviations from the normal
Several techniques exist, based on statistical analysis PCA: Principal Component Analysis Kalman filter Wavelets …
Some techniques require little or no training ASTUTE
75
November 7, 11
Outline
Network diagnostics What, why, and how?
A view from the edge Topology discovery Bandwidth measurements Network Tomography
A view from the inside Detailed diagnostics and root cause analysis Traffic anomaly detection
Conclusions
November 7, 11
Wrapping up Diagnosing networks may be quite challenging
Complexity Scale Network elements are dumb
Hence, solutions can be quite sophisticated Simply analyzing alarms, triggers may be time consuming and allows faults
“fly under the radar” o False positives, black holes, complex dependencies, …
Statistical analysis, data mining Combine many sources of data, even “low” quality
o Whatever is available Two viewpoints:
From the edge: tomography o Controlled probing of network
From the inside o Collect (passively) and analyze as much data as you can o Anomaly detection
77
November 7, 11
Want to know more?
78
1. V. Jacobson, “Pathchar: A Tool to Infer Characteristics of Internet Paths,” ftp://ftp.ee.lbl.gov/pathchar/, Apr. 1997. 2. A.B. Downey, “Using Pathchar to Estimate Internet Link Characteristics,” in Proceedings of ACM SIGCOMM, Sept. 1999, pp. 222–223. 3. K. Lai and M.Baker, “Measuring Link Bandwidths Using a Deterministic Model of Packet Delay,” in Proceedings of ACM SIGCOMM, Sept. 2000,
pp. 283–294. 4. A. Pasztor and D. Veitch, “Active Probing using Packet Quartets,” in Proceedings Internet Measurement Workshop (IMW), 2002. 5. R. S. Prasad, M. Murray, C. Dovrolis, and K. Claffy, “Bandwidth estimation: metrics, measurement techniques, and tools,” in IEEE Network,
November/December 2003. 6. S. Keshav, “A Control-Theoretic Approach to Flow Control,” In Proceedings of ACM SIGCOMM, Sept. 1991, pp. 3–15. 7. R. L. Carter and M. E. Crovella, “Measuring Bottleneck Link Speed in Packet-Switched Networks,” Performance Evaluation, vol. 27,28, pp. 297–
318, 1996. 8. V. Paxson, “End-to-End Internet Packet Dynamics,” IEEE/ACM Transaction on Networking, vol. 7, no. 3, pp. 277–292, June 1999. 9. M. Jain and C. Dovrolis, “End-to-End Available Bandwidth: Measurement Methodology, Dynamics, and Relation with TCP Throughput,” in
Proceedings of ACM SIGCOMM, Aug. 2002, pp. 295–308. 10. B. Melander, M. Bjorkman, and P. Gunningberg, “Regression-Based Available Bandwidth Measurements,” in International Symposium on
Performance Evaluation of Computer and Telecommunications Systems, 2002. 11. V. Ribeiro, R. Riedi, R. Baraniuk, J. Navratil, and L. Cottrell, “pathChirp: Efficient Available Bandwidth Estimation for Network Paths,” in
Proceedings of Passive and Active Measurements (PAM) workshop, Apr. 2003. 12. N. Hu and P. Steenkiste, “Evaluation and Characterization of Available Bandwidth Probing Techniques,” IEEE Journal on Selected Areas in
Communications, 2003. 13. K. Harfoush, A. Bestavros, and J. Byers, “Measuring Bottleneck Bandwidth of Targeted Path Segments,” in Proceedings of IEEE INFOCOM,
2003. 14. M. Allman, “Measuring End-to-End Bulk Transfer Capacity,” in Proceedings of ACM SIGCOMM Internet Measurement Workshop, Nov. 2001,
pp. 139–143. 15. L.Lao, M.Y.Sanadidi, C. Dovrolis. The Probe Gap Model can Underestimate the Available Bandwidth of Multihop Paths. In the ACM SIGCOMM
Computer Communications Review, October 2006. 16. D.Antoniades, M.Athanatos, A.Papadogiannakis, E.P.Markatos, C. Dovrolis. Available bandwidth measurement as simple as running wget. Passive
and Active Measurements (PAM) conference, March 2006. 17. Rohit Kapoor et al.: CapProbe: A Simple and Accurate Capacity Estimation Technique (SIGCOMM 2004) 18. D. Croce, T. En-Najjary, G. Urvoy-Keller and E. Biersack. Capacity Estimation of ADSL links. Proc. of the 4th ACM CoNEXT conference,
December 2008.
Bandwidth measurements
November 7, 11
Want to know more?
79
1. Pietro Marchetta, Pascal Mérindol, Benoit Donnet, Antonio Pescapé and Jean-Jacques Pansiot. Topology Discovery at the Router Level: A New Hybrid Tool Targeting ISP Networks. IEEE Journal on Selected Areas in Communication, Special Issue on Measurement of Internet Topologies, 2011. to appear.
2. Jean-Jacques Pansiot, Pascal Mérindol, Benoit Donnet and Olivier Bonaventure. Extracting Intra-Domain Topology from mrinfo Probing. In Arvind Krishnamurthy and Bernhard Plattner, editor, Proc. Passive and Active Measurement Conference (PAM), pages 81-90, April 2010. Springer Verlag.
3. B. Donnet, P. Raoult, T. Friedman and M. Crovella. Deployment of an Algorithm for Large-Scale Topology Discovery. IEEE Journal on Selected Areas in Communications, Sampling the Internet: Techniques and Applications, 24(12):2210-2220, Dec. 2006.
4. Y. Bejerano, “Taking the Skeletons Out of the Closets: A Simple and Efficient Topology Discovery Scheme for Large Ethernet,” Proc. IEEE INFOCOM, Apr. 2006.
5. Z. M. Mao et al., “Scalable and Accurate Identification of ASlevel Forwarding Paths,” Proc. IEEE INFOCOM, Mar. 2004.
6. P. Mahadevan et al., “The Internet AS-Level Topology: Three Data Sources and One Definitive Metric” ACM SIGCOMM Computer Commun. Rev iew, vol . 36, no. 1, Jan. 2006, pp. 17–26.
7. B. Donnet and T. Friedman. Internet Topology Discovery: a Survey. IEEE Communications Surveys and Tutorials, 9(4):2-15, December 2007.
Topology discovery
November 7, 11
Want to know more?
80
1. Castro, M. Coates, G. Liang, R. Nowak, and B. Yu, “Network Tomography: Recent Developments”, Statistical Science, Vol. 19, No. 3 (2004), 499-517.
2. A. Adams et al., “The Use of End-to-end Multicast Measurements for Characterizing Internal Network Behavior”, IEEE Communications Magazine, 2000.
3. N. Duffield, “Network Tomography of Binary Network Performance Characteristics”, IEEE Transactions on Information Theory, 2006.
4. R. R. Kompella, J. Yates, A. Greenberg, A. C. Snoeren, “Detection and Localization of Network Blackholes”, IEEE INFOCOM, 2007.
5. A. Dhamdhere, R. Teixeira, C. Dovrolis, and C. Diot, “NetDiagnoser:Troubleshooting network unreachabilities using end-to-end probes and routing data”, CoNEXT, 2007.
6. I. Cunha, R. Teixeira, N. Feamster, and C. Diot, “Measurement Methods for Fast and Accurate Blackhole Identification with Binary Tomography”, IMC, 2009.
7. E. Katz-Bassett, H. V. Madhyastha, J. P. John, A. Krishnamurthy, D. Wetherall, T. Anderson, “Studying Black Holes in the Internet with Hubble”, NSDI, 2008.
8. M. Zhang, C. Zhang, V. Pai, L. Peterson, and R. Wang, “PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services”, OSDI, 2004.
9. H. X. Nguyen , R. Teixeira, P. Thiran, and C. Diot, " Minimizing Probing Cost for Detecting Interface Failures: Algorithms and Scalability Analysis", INFOCOM, 2009.
Network Tomography
November 7, 11
Want to know more?
81
1. Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal, Jitendra Padhye, and Paramvir Bahl. 2009. Detailed diagnosis in enterprise networks. In Proceedings of the ACM SIGCOMM 2009.
2. Ajay Anil Mahimkar, Han Hee Song, Zihui Ge, Aman Shaikh, Jia Wang, Jennifer Yates, Yin Zhang, and Joanne Emmons. Detecting the performance impact of upgrades in large operational networks. In ACM SIGCOMM 2010.
3. Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage. California fault lines: understanding the causes and impact of network failures. In ACM SIGCOMM 2010.
Detailed diagnostics
Anomaly detection 1. Fernando Silveira, Christophe Diot, Nina Taft, and Ramesh Govindan. 2010. ASTUTE: detecting
a different class of traffic anomalies. In Proceedings of the ACM SIGCOMM 2010. 2. Yin Zhang, Matthew Roughan, Walter Willinger, and Lili Qiu. 2009. Spatio-temporal compressive
sensing and internet traffic matrices. In Proceedings of the ACM SIGCOMM 2009. 3. Anukool Lakhina, Mark Crovella, and Christophe Diot. 2005. Mining anomalies using traffic
feature distributions. In Proceedings of SIGCOMM 2005. 4. Haakon Ringberg, Augustin Soule, Jennifer Rexford, and Christophe Diot. 2007. Sensitivity of
PCA for traffic anomaly detection. In Proceedings of the 2007 ACM SIGMETRICS.