t-110.5116 computer networks ii€¦ · snmp traps, syslog msgs, trouble tickets, traffic traces...

T-110.5116 Computer Networks II Network diagnostics

07.11.2010 Matti Siekkinen

(Sources: R.Teixeira: Internet measurements: fault detection, identification, and topology discovery; S. Kandula: Detailed Diagnosis in Enterprise Networks)

November 7, 11

Outline

 Network diagnostics   What, why, and how?

 A view from the edge   Topology discovery   Bandwidth measurements   Network Tomography

 A view from the inside   Detailed diagnostics and root cause analysis   Traffic anomaly detection

 Conclusions

November 7, 11

Network diagnostics

 Understand how the network is doing   Detect and diagnose faults (links, routers, …)   Identify performance bottlenecks

o  E.g. congested link  Detect and quarantine misbehaving devices or

traffic   Anomaly detection   E.g. misconfigured router, attacks

3

November 7, 11

Why bother?

 Keep things going   Stuff breaks down   Operators and admins are human beings and make

mistakes   Want to keep the networks operational

 Maximum benefit out of the infrastructure   Equipment costs money   Maximize utilization

 Happy customers   Performance troubles make them unhappy   Unhappy customers may decrease revenues

4

November 7, 11

Why is it challenging?

 Few built-in diagnosis mechanisms   Today’s networks run on IP   Network elements are “simple”   Intelligence lies at the edges ⇒ May need to use complex end-to-end methods to

measure simple things (e.g. link capacity)  Scale can be very large

  Traffic volumes   Number of nodes   Different services and protocols ⇒ Diagnosis techniques need to be scalable too

5

November 7, 11 6

Diagnosing networks

 Obtain some input data   SNMP traps, syslog msgs, trouble tickets, traffic

traces etc.

 Inference / Analysis   Analyze the input data   E.g. learn that a router link is down

 Do something about it   E.g. start fixing the link

Collect raw measurements

Analyze measurements

Use learned information

November 7, 11 7

Ways to collect data for diagnosis

 Management tools   Ask the devices how they are doing   Receive alarms, traps   E.g. SNMP

 Passive measurements   Simply record what you observe   E.g. Cisco’s Netflow traffic data or raw traffic

header traces  Active measurements

  Send probes and observe what happens to them   E.g. tomography, bandwidth measurements

November 7, 11 8

 Network aggregation points   Router, switch   Access, gateway, backbone   Depends on scale, available

methods, and objectives  Client or server

  Usually limited possibilities   Possible in data center networks

Where to collect measurements?

Backbone router

Access router

Customers Customers

Gateway router

ISP 2 ISP 3

ISP 1

November 7, 11 9

Analyzing data

 On-line   Perform (at least a part of) the analysis on the observed data in

a real-time manner ☺ Data reduction -> don’t store everything ☺ Can react quickly ☹ Scalability

•  Intenet2 Network backbone link (10 Gbit/s) produces >8 MB/s of uncompressed packet headers

•  May need sampling, aggregation ☹ Do not necessarily have all the raw data for later analysis

 Off-line   Record data into persistent storage and analyze later ☺ Run complex time-consuming analysis ☹ Not for time critical analysis ☹ Storage issues

November 7, 11

Analyzing data (cont.)

 Human vs. machine   Statistical analysis and data mining techniques   Reveal non-trivial patterns (aggregate/similar behavior,

anomalies)   Still need an admin/operator somewhere in the loop

 Combine many data sources   Increase robustness

o  Fewer false positives   Detect issues that would normally “fly under the radar”

o  Aggregated input feeds may reveal more

10

November 7, 11

Analyzing data (cont.)

 Why not just rely on alarms and traps (e.g. SNMP)?   Network “Black holes”

o  Silent failures: nw devices do not send alarms o  Causes: complex cross-layer interactions, router sw bugs/

misconfigurations, …   Need for more detailed, application-specific diagnosis   Want insight into the causes of failures/problems that

raise alarms   Diagnosis across administrative domains

11

November 7, 11

Outline




 Conclusions

November 7, 11

A view from the edge

 Measurements and diagnosis done from the end points   Not a strict classification of methodology…

 Rely on   Probing messages analyzed at the other end   Clever use of standard protocols

o  ping, traceroute o  Cannot use custom protocols since routers won’t react to and/

or forward them  Three example cases

  Topology discovery   Bandwidth measurements   Network tomography

o  Topology information needed

13

November 7, 11

Topology discovery  Topology describes how the network is laid out

  Links between routers, switches, etc.   Not trivial knowledge in large scale networks

 Why do this?   Some diagnosis methods rely on accurate topology

information o  Network tomography

  Realistic simulation and modeling of the Internet o  Topology models needed for simulations o  Correctness of network protocols typically independent

of topology o  Performance of networks critically dependent on

topology •  e.g., convergence of route information

14

November 7, 11

Measuring topology   Different vantage points

  With access to routers (or “from inside”) o  Topology of one network o  Routing monitors (OSPF or IS-IS)

  No access to routers (or “from outside”) o  Multi-AS topology or from end-hosts o  Monitors issue active probes: traceroute

  Different granularity levels   Router-level topologies

o  Reflect physical connectivity between nodes o  Inferred using with e.g. traceroute

  AS graphs o  Peering relationships between providers/clients o  Inferred from inter-domain routers’ BGP tables o  Could also use traceroute with some additional information

15

November 7, 11 16

Topology from inside

 Routing protocols flood state of each link   Periodically refresh link state   Report any changes: link down, up, cost change

 Monitor listens to link-state messages   Acts as a regular router

o  AT&T’s OSPFmon or Sprint’s PyRT for IS-IS  Combining link states gives the topology

  Easy to maintain, messages report any changes  Usually not possible across domains

November 7, 11

Inferring a path from outside: traceroute

17

A B

TTL = 1

A.1 A.2 B.2 B.1

TTL = 2

TTL exceeded from A.1

TTL exceeded from B.1

Actual path

Inferred path

A.1 B.1

m t

m t

November 7, 11

A traceroute path can be incomplete

 Load balancing is widely used   Traceroute only probes one path

 Sometimes traceroute has no answer (stars)   ICMP rate limiting for DoS protection   Anonymous routers

o  Do not send ICMP replies at all or reply with probe’s destination IP

o  Security and privacy concerns  Tunnelling (e.g., MPLS) may hide routers

  Routers inside the tunnel may not decrement TTL

18

November 7, 11 19

Traceroute under load balancing

L

B

A C

D

L

A

D

C

TTL = 2

TTL = 3

B

E

E

Missing nodes and links

False link

Actual path

Inferred path

m

m t

t

November 7, 11 20

L

B

A C

D

TTL = 2 Port 2

TTL = 3 Port 3

E

 Even per-flow load balancing causes trouble  Traceroute uses the destination port as identifier

  Needs to match probe to response   Response only has the header of the issued probe

m t

Traceroute under load balancing (cont.)

November 7, 11 21

Paris traceroute   Solves the problem with per-flow load balancing

  Probes to a destination belong to same flow   Keep flow IDs constant for probes to specific destination

  Flow ID = src/dest IP & port, TP protocol   How to match probes with ICMP responses?

  Vary fields within first eight octets of TP-layer header (included in ICMP response)

  Keep the flow ID related fields constant   UDP probes: vary checksum (need to manipulate payload too)   ICMP probes: vary #seq, but also Identifier -> keep checksum constant

L

B

A C

D

TTL = 2 Port 1

TTL = 3 Port 1 E Checksum 3 Checksum 2

m t More details in: B. Augustin et al: "Avoiding traceroute anomalies with Paris traceroute”. IMC 2006.

November 7, 11

4 2 1

1

More traceroute shortcomings

 Inferred nodes = interfaces, not routers   Different interfaces have different IP address

 Coverage depends on monitors and targets   Misses links and routers   Some links and routers appear multiple times

22

1 A

D

3 B 2

3

2

3 1 m1

t1

m2

t2 C

Actual topology

A.1 m1 t1

m2 t2

Inferred topology

C.1 D.1

C.2

B.3

2

November 7, 11

Alias resolution: Map interfaces to routers

 Direct probing   Responses from the same router

will have close IP identifiers and same TTL

  N. Spring et al: “Measuring ISP Topologies with Rocketfuel”, SIGCOMM 2002.

  Record-route IP option   Records up to nine IP addresses of

routers in the path   Can also discover outgoing

interfaces   R. Sherwood et al: “DisCarte: A

Disjunctive Internet Cartographer”, SIGCOMM, 2008.

23

A.1 m1 t1

m2 t2

Inferred topology

C.1 D.1

C.2

B.3 same router

November 7, 11

Large-scale topology measurements

 Probing a large topology takes time   E.g., probing 1200 targets from PlanetLab nodes takes 5

minutes on average (using 30 threads)   Probing more targets covers more links   But, getting a topology snapshot takes longer

 Snapshot may be inaccurate   Paths may change during snapshot   To know that a path changed, need to re-probe

 Make more feasible by reducing redundant probing   Aggregation points in network   B. Donnet et al.: “Efficient Algorithms for Large-Scale Topology

Discovery”. In SIGMETRICS 2005.

24

November 7, 11

Outline




 Conclusions

November 7, 11

Bandwidth measurements

 What?   Infer the bandwidth of a specific hop or of a whole path   Capacity = maximum possible throughput   Available bandwidth = portion of capacity not currently

used   Bulk transfer capacity = throughput that a new single

long-lived TCP connection could obtain  Why?

  Network aware applications o  Server or peer selection o  Route selection in overlay networks

  QoS verification

26

November 7, 11 27

Challenges

 Routers and switches do not provide direct feedback to end-hosts (except ICMP, also of limited use)   Mostly due to scalability, policy, and simplicity reasons

 Network administrators can read router/switch information using SNMP protocol

 End-to-end bandwidth estimation cannot be done in the above way   No access because of administrative barriers

November 7, 11 28

The Internet is a “black box”

The Internet

  End-systems can infer network state through end-to-end (e2e) measurements   Without any feedback from routers   Objectives: accuracy, speed, minimal intrusiveness

Probing packets

November 7, 11 29

 Simple example of an end-to-end path

Metrics and definitions

router1

cross traffic

link1 (access link) router2

cross traffic

link2

source host

destination host

link3 (access link)

November 7, 11 30

Metrics and definitions (cont.)

 Capacity of this path is 100 Mbps   Determined by the narrow link

 Available bandwidth of this path is 50 Mbps   Determined by the tight link

narrow link tight link

100 Mbps 90 Mbps 10 Mbps



link capacity

available bandwidth used bandwidth

link1 link3 link2

November 7, 11

Measurement techniques   Generally use active probing

  Send packets with a specific inter-arrival pattern

  Observe the pattern at the other end

  Example: Packet-pair technique for capacity estimation   Send two equal-sized packets

back-to-back o  Packet size: L o  Packet tx time at link i: L/Ci

  P-P dispersion: time interval between first bit of two packets

  Without any cross traffic, the dispersion at receiver is determined by narrow link:

31

path capacity

November 7, 11 32

Bandwidth estimation: CapProbe

  Rohit Kapoor et al.: CapProbe: A Simple and Accurate Capacity Estimation Technique (SIGCOMM 2004)

  CapProbe is a capacity estimation tool  Takes into account effect of cross-traffic   Cross traffic packets can affect P-P dispersion

  P-P expansion: capacity underestimation   P-P compression: capacity overestimation

 Noise in P-P distribution depends on cross traffic load

November 7, 11 33

  No cross-traffic

Capacity = (Packet Size) / (Dispersion)

CapProbe: Ideal Packet Dispersion

November 7, 11 34

  Cross-traffic (CT) serviced between PP packets   Second packet queues due to Cross Traffic (CT )=> expansion of

dispersion =>Under-estimation

CapProbe: Expansion of Dispersion

November 7, 11 35

  First packet queueing => compressed dispersion => Over-estimation

CapProbe: Compression of Dispersion

November 7, 11 36

CapProbe: The approach

 Observations:   First packet queues more than the second

o  Compression o  Over-estimation

  Second packet queues more than the first o  Expansion o  Under-estimation

  Both expansion and compression are the result of probe packets experiencing queuing

o  Sum of PP delay includes queuing delay   Filter PP samples that do not have minimum queuing time  Dispersion of PP sample with minimum delay sum reflects

capacity

November 7, 11 37

CapProbe Observation

  For each packet pair, CapProbe calculates delay sum: delay(packet_1) + delay(packet_2)

 A PP with the minimum delay sum points out the capacity

capacity

November 7, 11 38

Bandwidth estimation tools

 Many estimation tools & techniques   Abing, netest, pipechar, STAB, pathneck, IGI/PTR, abget,

Spruce, pathchar, clink, pchar, PPrate, DSLprobe, ABwProbe, …

 Some practical issues   Traffic shapers   Non-FIFO queues

 More scalable methods   Passive measurements instead of active measurements

o  E.g. PPrate (2006) for capacity estimation: adapt Pathrate’s algorithm

  One measurement host instead of two cooperating ones o  abget (2006) for available bandwidth estimation o  DSLprobe for capacity estimation of asymmetric (ADSL) links

November 7, 11

Outline




 Conclusions

November 7, 11

Different viewpoints of the network

 Network operators only have data of one AS   AS4 doesn’t detect any problem   AS3 doesn’t know who is affected by the failure

 End-hosts don’t know what happens in the network   Can only monitor end-to-end paths

40

AS1

AS2 AS3

AS4

November 7, 11

Network Tomography

41

 View network as “black box”   Probe network from the edge   Analogy to medical imaging with x-

rays  Fault diagnosis with tomography

  Diagnose persistent reachability problems across domains

  Useful to detect black holes o  Silent failures that no not produce

alerts!  Two phase approach

  Detect: End-to-end path monitoring   Localize: Binary tomography

November 7, 11

Outline



o  Fault detection o  Fault localization


 Conclusions

November 7, 11

Fault detection: end-to-end monitoring

43

 What is monitored?   Different properties of network links   Loss rate, delay, bandwidth, connectivity

 Who monitors?   Network operators

o  In network monitoring hosts o  Third-party monitoring services o  From home gateways

  End users o  Cooperative monitoring o  Users of popular services/applications

November 7, 11

Fault detection: end-to-end monitoring (cont.)

44

probe sender

probe collectors

 How monitored?   End-to-end: from probe senders to

collectors   No access to routers   Using multicast probes would be

efficient o  IP multicast not deployed in practice

November 7, 11

Monitoring techniques

 Active probing: ping   Send probe, collect response   From any end host

o  Works for network operators and end users

 Passive analysis of user’s traffic   Tap incoming and outgoing traffic

o  At user’s machines or servers: tcpdump, pcap o  Inside the network: DAG card

  Monitor status of TCP connections

45

November 7, 11

Fault detection with ping

 If receives reply   Then, path is good

 If no reply before timeout   Then, path is bad

46

m

t probe ICMP

echo request

reply ICMP

echo reply

November 7, 11

Persistent failure or measurement noise?

 Many reasons to lose probe or reply   Timeout may be too short   Rate limiting at routers   Some end-hosts don’t respond to ICMP request   Transient congestion   Routing change

 Need to confirm that failure is persistent   Otherwise, may trigger false alarms

47

November 7, 11

Failure confirmation   Upon detection of a failure, trigger extra probes   Two main parameters

  Number of probes o  Too large nb perturbs path and may cause additional losses

  Time between probes o  Periodic? Which interval?

  Parameter values determine the robustness and reactiveness   Make sure that indeed we observe a persistent failure   Do not spend too long in confirming the failure

48

time

loss burst packets on

a path

false positive More details in: I. Cunha et al: “Measurement Methods for Fast and Accurate Blackhole Identification with Binary Tomography”. IMC 2009.

November 7, 11

Passive detection

 At end hosts   tcpdump/pcap captures packets   Track status of each TCP connection

o  RTTs, timeouts, retransmissions   Multiple timeouts indicate path is bad

 Inside the network it is more challenging   Traffic volume is high

o  Need special hardware •  DAG cards can capture packets at high speeds

o  May lose packets   Tracking TCP connections is hard

o  May not capture both sides of a connection o  Large processing and memory overhead

49

November 7, 11

Passive vs. active detection Passive

+  No need to inject traffic

+  Detects all failures that affect user’s traffic

+  Responses from targets that don’t respond to ping

Active

+  No need to tap user’s traffic

+  Detects failures in any desired path

+  Fast detection

50

‒  Not always possible to tap user’s traffic ‒  Only detects failures in paths with traffic

‒  Probing overhead ‒  Deployment overhead

November 7, 11

Aggregation: motivation

 Lack of synchronization leads to inconsistencies   Probes cross links at different times   Path may change between probes

 Measurements from a single monitor   Probing all targets can take time

 Measurements from multiple monitors   Hard to synchronize monitors for all

probes to reach a link at the same time   Impossible to generalize to all links

51

m

t1 t2

mistakenly inferred failure

November 7, 11

Aggregation: motivation (cont.)

52

m1

t1 tN

mK

mK,t1

mK,tN

…

m1,t1

m1,tN

…

path reachability

good

good

…

good

bad …

inconsistent measurements

November 7, 11

Aggregation: Basic idea

53

 Consistency has a cost   Delays fault localization   Cannot identify short failures

m1

t1 tN

mK

mK,t1

mK,tN …

m1,t1

m1,tN

…

path reachability

good

bad

…

good

bad

…

 Reprobe paths after failure

November 7, 11

Aggregation  Aggregation: building reachability matrix out of path

measurements  Three strategies

  Aggregation always triggered by a path status change   BASIC

o  Build reachability matrix after next full monitoring cycle C o  Consistency problems remain

  MC (multi-cycle) o  Wait n cycles until reach identical measurements o  High consistency in the expense of increased delay

•  e.g. frequent detection errors   MC-PATH (multi-cycle noise tolerant)

o  Strikes a balance between BASIC and MC o  Fix n and do not require identical status for all paths o  Unstable paths marked as “up”

54

November 7, 11

Aggregation: tradeoff

 Consistency vs. localization speed   Faster localization leads to false alarms   Slower localization misses short failures

 Network operators   Too many false alarms are unmanageable   Longer failures are the ones that need intervention

 End users   Even short failures affect performance

55

More details in: I. Cunha et al: “Measurement Methods for Fast and Accurate Blackhole Identification with Binary Tomography”. IMC 2009.

November 7, 11

Outline



o  Fault detection o  Fault localization


 Conclusions

November 7, 11

Fault localization: Binary tomography

 Labels paths as good or bad   Loss-rate estimation requires tight

correlation   Instead, separate good/bad performance   If link is bad, all paths that cross the

link are bad  Find the smallest set of links that

explains bad paths   Given bad links are uncommon   Bad link is the root of maximal bad

subtree

57

m

t1 t2

bad

1 1 0 1 0 1

good bad

November 7, 11

Binary tomography in practice  Multiple sources and targets  Problem becomes NP-hard

  Minimum hitting set problem  Iterative greedy algorithm

  Given the set of links in bad paths   Iteratively choose link that explains the

max number of bad paths  Further issues

  Topology may be unknown o  Need to measure accurate topology

  Multicast not available o  Need to extract correlation from unicast

probes

58

m2

t1 t2

m1

Hitting set of link = paths that traverse

the link

November 7, 11

Greedy approximation

 Method based on spatial correlation   Intersect set of OD-pairs that experienced failures to

discover shared links   Shared links are most likely explanation -> hypothesis

 Apply greedy approximation approach   Can have single, dual, multiple failures   Unfeasible to explore all possibilities   Just look for the most likely explanation

59

{G-H} as the hypothesis for the black hole

November 7, 11

Greedy approximation (cont.)

 Failure detection generates a failure signature   Set of lost probes for all OD-pair   A.k.a. reachability matrix

 MAX-COVERAGE algorithm   Pick link that explains most nb of observations   Add link to hypothesis   Remove corresponding observations from failure signature   Repeat above iteratively until no more observations in failure

signature remains

Set of OD-pairs in failure signature

Failed links

November 7, 11

Fault Localization: greedy algorithm  How to select the candidate link within hypothesis?

  ABSOLUTE: pick links that explain at least n observations

  RELATIVE: pick links that explain at least fraction f of observations

 Pros & cons   Avoid having to explore all possibilities   Can localize multiple failures   Bias in favor of links present in many paths

More details in: R. R. Kompella, J. Yates, A. Greenberg, A. C. Snoeren, “Detection and Localization of Network Blackholes”, IEEE INFOCOM, 2007.

November 7, 11

Outline




 Conclusions

November 7, 11

View from the inside

 Need for more detailed diagnostics   Tomography is too coarse grained   Basic SNMP-style does not reveal root causes

 Trade off some scale for detail   Focus on (small scale) enterprise networks

 Example case:   Diagnosis system called NetMedic

63

November 7, 11

Small enterprise network troubleshooting

 Requirements   Handle app-specific as well as generic faults   Identify culprits at a fine granularity

 Small enterprises have..   less sophisticated admins   less rich connectivity   many shared components

 Scale is not huge -> can afford detailed diagnosis

64

November 7, 11

Tradeoff between detail and scale

 Requirements for dependency graph based formulations   Model the network as a dependency graph at a

fine grained level o  Not just machine-level o  Not just faulty or healthy

  Complex dependency model o  Allow circular and direct/indirect mutual

dependencies  Trade off some scalability

65

November 7, 11

Example problem 1: Server misconfig

  Software update had changed Web server config   Incorrect processing of some scripts   Operator aware of update, not of config change

Web server

Browser

Browser

Server config

November 7, 11

Example problem 2: Buggy client

SQL server

SQL client C2

SQL client C1

Requests

 C1 experienced slow performance from SQL due to misbehaving C2

November 7, 11

Example problem 3: Client misconfig

Exchange server

Outlook config

Outlook config

 Client config was overridden with incorrect mail server type   Bug in client sw triggered by

overnight update

November 7, 11

A formulation for detailed diagnosis   Challenge: application specific

diagnosis in application agnostic way   Point solutions are easy but

unfeasible   Dependency graph of fine-grained

components   Process, OS, configs, nw path, …   Process depends on its OS,

configuration on the application it is running

  Snoop socket-level events to connect communicating processes

  Component state is a multi-dimensional vector   Application dependent and

independent variables   Diagnostic system unaware of

semantics   Variables exposed by OS and apps

o  E.g. Win. Perf. Counter, /proc/

SQL svr

Exch.svr IIS

svr

IIS config

Process OS Config

SQL client

C1

SQL client

C2

% CPU &me IO bytes/sec

Connec&ons/sec 404 errors/sec

November 7, 11

The goal of diagnosis

Svr

C1

C2

 Identify likely culprits for components of interest

 Without using semantics of state variables No application knowledge

Process OS Config

November 7, 11

Using joint historical behavior to estimate impact

D S

d0a d0

b d0c s0

a s0b s0

c s0d

dna dn

b dnc

. . .

. . .

. . .

. . .

. . . d1

a d1b d1

c

sna sn

b snc sn

d . . . . . . . . . . . . . . . . . . . .

s1a s1

b s1c s1

d

Identify time periods when state of S was “similar”

How “similar” on average states of D are at those times

Svr C1

C2

Request rate (low) Response time (high)

Request rate (high) Response time (high)

Request rate (high)

H

HL

If S in a previously unseen state, conservatively estimate the impact to be high

November 7, 11

Further details  Ranking of likely culprits

  Mainly using dependency graph’s path weight o  Geometric mean of edge weights

 Evaluated with a quite small setup   Works but scalability to large environments may be an issue   Netmedic needs a ~60 mins of history

 Two sub-problems at the intersection with HCI   Visualizing complex analysis (NetClinic)   Intuitiveness of analysis

More details in: Srikanth Kandula et al: “Detailed diagnosis in enterprise networks”. In SIGCOMM 2009. (Microsoft Research)

November 7, 11

Outline




 Conclusions

November 7, 11

Anomaly detection

 Study abnormal traffic   Non-productive traffic, a.k.a. Internet “background radiation”   Traffic that is malicious (scans for vulnerabilities, worms) or

mostly harmless (misconfigured devices)  Network troubleshooting

  Identify and locate misconfigured or compromised devices  Intrusion detection

  Identify malicious activity before it hits you   Analyze traffic for attack signatures

 Characterizing malicious activities   Honeypot: a host that sits and waits for attacks

o  Learn how attackers probe for and exploit a system   Network telescope: portion of routed IP address space on

which little or no legitimate traffic exists

74

November 7, 11

Anomaly detection techniques

 Typical procedure   Learn underlying models of normal traffic

o  May require long time (days or weeks)   Detect deviations from the normal

 Several techniques exist, based on statistical analysis   PCA: Principal Component Analysis   Kalman filter   Wavelets   …

 Some techniques require little or no training   ASTUTE

75

November 7, 11

Outline




 Conclusions

November 7, 11

Wrapping up   Diagnosing networks may be quite challenging

  Complexity   Scale   Network elements are dumb

  Hence, solutions can be quite sophisticated   Simply analyzing alarms, triggers may be time consuming and allows faults

“fly under the radar” o  False positives, black holes, complex dependencies, …

  Statistical analysis, data mining   Combine many sources of data, even “low” quality

o  Whatever is available   Two viewpoints:

  From the edge: tomography o  Controlled probing of network

  From the inside o  Collect (passively) and analyze as much data as you can o  Anomaly detection

77

November 7, 11

Want to know more?

78

1.  V. Jacobson, “Pathchar: A Tool to Infer Characteristics of Internet Paths,” ftp://ftp.ee.lbl.gov/pathchar/, Apr. 1997. 2.  A.B. Downey, “Using Pathchar to Estimate Internet Link Characteristics,” in Proceedings of ACM SIGCOMM, Sept. 1999, pp. 222–223. 3.  K. Lai and M.Baker, “Measuring Link Bandwidths Using a Deterministic Model of Packet Delay,” in Proceedings of ACM SIGCOMM, Sept. 2000,

pp. 283–294. 4.  A. Pasztor and D. Veitch, “Active Probing using Packet Quartets,” in Proceedings Internet Measurement Workshop (IMW), 2002. 5.  R. S. Prasad, M. Murray, C. Dovrolis, and K. Claffy, “Bandwidth estimation: metrics, measurement techniques, and tools,” in IEEE Network,

November/December 2003. 6.  S. Keshav, “A Control-Theoretic Approach to Flow Control,” In Proceedings of ACM SIGCOMM, Sept. 1991, pp. 3–15. 7.  R. L. Carter and M. E. Crovella, “Measuring Bottleneck Link Speed in Packet-Switched Networks,” Performance Evaluation, vol. 27,28, pp. 297–

318, 1996. 8.  V. Paxson, “End-to-End Internet Packet Dynamics,” IEEE/ACM Transaction on Networking, vol. 7, no. 3, pp. 277–292, June 1999. 9.  M. Jain and C. Dovrolis, “End-to-End Available Bandwidth: Measurement Methodology, Dynamics, and Relation with TCP Throughput,” in

Proceedings of ACM SIGCOMM, Aug. 2002, pp. 295–308. 10.  B. Melander, M. Bjorkman, and P. Gunningberg, “Regression-Based Available Bandwidth Measurements,” in International Symposium on

Performance Evaluation of Computer and Telecommunications Systems, 2002. 11.  V. Ribeiro, R. Riedi, R. Baraniuk, J. Navratil, and L. Cottrell, “pathChirp: Efficient Available Bandwidth Estimation for Network Paths,” in

Proceedings of Passive and Active Measurements (PAM) workshop, Apr. 2003. 12.  N. Hu and P. Steenkiste, “Evaluation and Characterization of Available Bandwidth Probing Techniques,” IEEE Journal on Selected Areas in

Communications, 2003. 13.  K. Harfoush, A. Bestavros, and J. Byers, “Measuring Bottleneck Bandwidth of Targeted Path Segments,” in Proceedings of IEEE INFOCOM,

2003. 14.  M. Allman, “Measuring End-to-End Bulk Transfer Capacity,” in Proceedings of ACM SIGCOMM Internet Measurement Workshop, Nov. 2001,

pp. 139–143. 15.  L.Lao, M.Y.Sanadidi, C. Dovrolis. The Probe Gap Model can Underestimate the Available Bandwidth of Multihop Paths. In the ACM SIGCOMM

Computer Communications Review, October 2006. 16.  D.Antoniades, M.Athanatos, A.Papadogiannakis, E.P.Markatos, C. Dovrolis. Available bandwidth measurement as simple as running wget. Passive

and Active Measurements (PAM) conference, March 2006. 17.  Rohit Kapoor et al.: CapProbe: A Simple and Accurate Capacity Estimation Technique (SIGCOMM 2004) 18.  D. Croce, T. En-Najjary, G. Urvoy-Keller and E. Biersack. Capacity Estimation of ADSL links. Proc. of the 4th ACM CoNEXT conference,

December 2008.

Bandwidth measurements

November 7, 11

Want to know more?

79

1.  Pietro Marchetta, Pascal Mérindol, Benoit Donnet, Antonio Pescapé and Jean-Jacques Pansiot. Topology Discovery at the Router Level: A New Hybrid Tool Targeting ISP Networks. IEEE Journal on Selected Areas in Communication, Special Issue on Measurement of Internet Topologies, 2011. to appear.

2.  Jean-Jacques Pansiot, Pascal Mérindol, Benoit Donnet and Olivier Bonaventure. Extracting Intra-Domain Topology from mrinfo Probing. In Arvind Krishnamurthy and Bernhard Plattner, editor, Proc. Passive and Active Measurement Conference (PAM), pages 81-90, April 2010. Springer Verlag.

3.  B. Donnet, P. Raoult, T. Friedman and M. Crovella. Deployment of an Algorithm for Large-Scale Topology Discovery. IEEE Journal on Selected Areas in Communications, Sampling the Internet: Techniques and Applications, 24(12):2210-2220, Dec. 2006.

4.  Y. Bejerano, “Taking the Skeletons Out of the Closets: A Simple and Efficient Topology Discovery Scheme for Large Ethernet,” Proc. IEEE INFOCOM, Apr. 2006.

5.  Z. M. Mao et al., “Scalable and Accurate Identification of ASlevel Forwarding Paths,” Proc. IEEE INFOCOM, Mar. 2004.

6.  P. Mahadevan et al., “The Internet AS-Level Topology: Three Data Sources and One Definitive Metric” ACM SIGCOMM Computer Commun. Rev iew, vol . 36, no. 1, Jan. 2006, pp. 17–26.

7.  B. Donnet and T. Friedman. Internet Topology Discovery: a Survey. IEEE Communications Surveys and Tutorials, 9(4):2-15, December 2007.

Topology discovery

November 7, 11

Want to know more?

80

1.  Castro, M. Coates, G. Liang, R. Nowak, and B. Yu, “Network Tomography: Recent Developments”, Statistical Science, Vol. 19, No. 3 (2004), 499-517.

2.  A. Adams et al., “The Use of End-to-end Multicast Measurements for Characterizing Internal Network Behavior”, IEEE Communications Magazine, 2000.

3.  N. Duffield, “Network Tomography of Binary Network Performance Characteristics”, IEEE Transactions on Information Theory, 2006.

4.  R. R. Kompella, J. Yates, A. Greenberg, A. C. Snoeren, “Detection and Localization of Network Blackholes”, IEEE INFOCOM, 2007.

5.  A. Dhamdhere, R. Teixeira, C. Dovrolis, and C. Diot, “NetDiagnoser:Troubleshooting network unreachabilities using end-to-end probes and routing data”, CoNEXT, 2007.

6.  I. Cunha, R. Teixeira, N. Feamster, and C. Diot, “Measurement Methods for Fast and Accurate Blackhole Identification with Binary Tomography”, IMC, 2009.

7.  E. Katz-Bassett, H. V. Madhyastha, J. P. John, A. Krishnamurthy, D. Wetherall, T. Anderson, “Studying Black Holes in the Internet with Hubble”, NSDI, 2008.

8.  M. Zhang, C. Zhang, V. Pai, L. Peterson, and R. Wang, “PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services”, OSDI, 2004.

9.  H. X. Nguyen , R. Teixeira, P. Thiran, and C. Diot, " Minimizing Probing Cost for Detecting Interface Failures: Algorithms and Scalability Analysis", INFOCOM, 2009.

Network Tomography

November 7, 11

Want to know more?

81

1.  Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal, Jitendra Padhye, and Paramvir Bahl. 2009. Detailed diagnosis in enterprise networks. In Proceedings of the ACM SIGCOMM 2009.

2.  Ajay Anil Mahimkar, Han Hee Song, Zihui Ge, Aman Shaikh, Jia Wang, Jennifer Yates, Yin Zhang, and Joanne Emmons. Detecting the performance impact of upgrades in large operational networks. In ACM SIGCOMM 2010.

3.  Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage. California fault lines: understanding the causes and impact of network failures. In ACM SIGCOMM 2010.

Detailed diagnostics

Anomaly detection 1.  Fernando Silveira, Christophe Diot, Nina Taft, and Ramesh Govindan. 2010. ASTUTE: detecting

a different class of traffic anomalies. In Proceedings of the ACM SIGCOMM 2010. 2.  Yin Zhang, Matthew Roughan, Walter Willinger, and Lili Qiu. 2009. Spatio-temporal compressive

sensing and internet traffic matrices. In Proceedings of the ACM SIGCOMM 2009. 3.  Anukool Lakhina, Mark Crovella, and Christophe Diot. 2005. Mining anomalies using traffic

feature distributions. In Proceedings of SIGCOMM 2005. 4.  Haakon Ringberg, Augustin Soule, Jennifer Rexford, and Christophe Diot. 2007. Sensitivity of

PCA for traffic anomaly detection. In Proceedings of the 2007 ACM SIGMETRICS.

t-110.5116 computer networks ii€¦ · snmp traps, syslog msgs, trouble tickets, traffic traces...

Documents