Download - T-110.5116 Computer Networks II– SNMP traps, syslog msgs, trouble tickets, traffic traces etc. • Inference / Analysis – Analyze the input data – E.g. learn that a router link

• 23.9.2010

T-110.5116 Computer Networks II Network diagnostics and traffic analysis 12/19.11.2012 Matti Siekkinen

(Sources: R.Teixeira: Internet measurements: fault detection, identification, and topology discovery; S. Kandula: Detailed Diagnosis in Enterprise Networks)

Concerning exam dates

•  Network security exam on same date: December 17. –  Different time though…

•  Now additional exam date: January 3. 2013

• 2

Outline

•  What is QoS? –  Overview of QoS mechanisms

•  Network diagnostics and traffic analysis –  What, why, and how?

•  Measuring networks –  Topology discovery –  Bandwidth measurements –  Network Tomography

•  Traffic analysis –  Root cause analysis –  Application-level analysis –  Traffic anomaly detection

•  Conclusions

What is Quality of Service?

•  Many applications are sensitive to delay, jitter, and packet loss –  Too high values makes utility drop to zero

•  Some mission-critical applications cannot tolerate disruption –  VoIP –  high-availability computing

•  Related concept is service availability –  How likely is it that I can place a call and not get interrupted? –  requires meeting the QoS requirements for the given application

Example QoS Requirements

Personal voice over IP Network

monitoring

CEO Video conference with analysis

Financial Transactions

Interactive whiteboard

Unicast radio

Network management traffic Extranet

web traffic Public web traffic

Push news

Personal e-mail

Business e-mail

Server backups

Sensitive

Insensitive

Casual Critical

Delay

Mission Criticality

How to guarantee QoS

•  Provisioning (before any data packets sent) –  Admission control

•  Prohibit or allow new flows to enter the nw •  Make sure we have necessary available bandwidth in network

–  Resource reservation •  Reserve the necessary available bandwidth in network

•  Control (during data transfer) –  Scheduling (FIFO, WFQ)

•  Which flow gets a piece of resources at a given time instant –  Queue mgmt (drop-tail, RED)

•  If buffer fills up, which flow do we punish? –  Policing (leaky/token bucket)

•  Enforce flows to behave according to agreed policy •  E.g. send traffic at constant rate R

How to guarantee QoS (cont.)

Packet Scheduling

Admission Control

Traffic Shaping

(Users get their share of bandwidth) (Policing to ctrl

amount of traffic users can inject into the network)

(To accept or reject a flow based on flow specifications)

Core

How to guarantee QoS (cont.)

•  These are network-layer techniques –  Each router needs to support this –  Together allow perfect control of QoS

•  Internet does not implement these mechanisms –  Works today only within some ISPs network

•  Technically we know how to do it Internet wide but other reasons prevent deployment –  E.g. lack of business models

What if we had QoS guarantees?

•  Your Internet subscription states the SLA –  Describes what kind of service you will get –  E.g. guaranteed bandwidth of B with max delay of D when there are

no higher priority customers present •  How would you perceive that?

–  YouTube video would either stream perfectly or might not load at all •  May have no admission with your SLA at the moment

–  Skype call would never be of bad quality but call can be refused or interrupted

–  Downloading file (size S) happens in exactly B*S seconds –  Obviously, assuming network is not broken and you have coverage

(wireless)…

• 9

QoS in Today’s Internet

TCP/UDP/IP: “best-effort service” •  no guarantees on delay, loss

Today’s Internet applications use application-level techniques to mitigate

(as best possible) effects of delay, loss

But some apps (multimedia) require QoS and level of performance to be

effective!

? ? ? ? ?

?

? ? ?

?

?

What can be done today to control QoS

•  Mainly application-level techniques –  Application adapts to network conditions

•  Buffer stream data, conceal errors, … –  Use overlay networks –  No need to change anything in routers

•  Make the best out of the best effort network –  Cannot guarantee anything

•  No guarantees means we cannot be sure what kind of QoS we get –  Monitoring is important –  Enter network diagnostics and traffic analysis…

Outline





•  Conclusions

Network diagnostics and traffic analysis

•  Understand how the network is doing –  Detect and diagnose faults (links, routers, …) –  Identify performance bottlenecks

•  E.g. congested link

•  Detect and quarantine misbehaving devices or traffic –  Anomaly detection –  E.g. misconfigured router, attacks

•  Learn what kind of QoS users perceive –  Performance evaluation of applications –  Analyze resulting traffic to infer perceived QoS –  Goal is obviously to improve if possible

• 13

Why bother?

•  Keep things going –  Stuff breaks down –  Operators and admins are human beings and make mistakes –  Want to keep the networks operational

•  Maximum benefit out of the infrastructure –  Equipment costs money –  Maximize utilization

•  Happy customers –  Performance troubles make them unhappy –  Unhappy customers may decrease revenues

• 14

Why is it challenging?

•  Few built-in diagnosis mechanisms –  Today’s networks run on IP –  Network elements are “simple” –  Intelligence lies at the edges ⇒ May need to use complex end-to-end methods to measure

simple things (e.g. link capacity)

•  Scale can be very large –  Traffic volumes –  Number of nodes –  Different services and protocols ⇒ Diagnosis techniques need to be scalable too

• 15

Diagnosing networks

•  Obtain some input data –  SNMP traps, syslog msgs, trouble tickets, traffic

traces etc.

•  Inference / Analysis –  Analyze the input data –  E.g. learn that a router link is down

•  Do something about it –  E.g. start fixing the link

• 16

Collect raw measurements

Analyze measurements

Use learned information

Ways to collect data for diagnosis

•  Management tools –  Ask the devices how they are doing –  Receive alarms, traps –  E.g. SNMP

•  Passive measurements –  Simply record what you observe –  E.g. Cisco’s Netflow traffic data or raw traffic header traces

•  Active measurements –  Send probes and observe what happens to them –  E.g. tomography, bandwidth measurements

• 17

Where to collect measurements?

•  Network aggregation points –  Router, switch –  Access, gateway, backbone –  Depends on scale, available methods,

and objectives

•  Client or server –  Usually limited possibilities –  Possible in data center networks

• 18

Backbone router

Access router

Customers Customers

Gateway router

ISP 2 ISP 3

ISP 1

Analyzing data

•  On-line –  Perform (at least a part of) the analysis on the observed data in a

real-time manner ☺ Data reduction -> don’t store everything ☺ Can react quickly ☹ Scalability

•  10 Gbit/s link produces >8 MB/s of uncompressed packet headers •  May need sampling, aggregation

☹ Do not necessarily have all the raw data for later analysis •  Off-line

–  Record data into persistent storage and analyze later ☺ Run complex time-consuming analysis ☹ Not for time critical analysis ☹ Storage issues

• 19

Analyzing data (cont.)

•  Human vs. machine –  Statistical analysis and data mining techniques –  Reveal non-trivial patterns (aggregate/similar behavior,

anomalies) –  Still need an admin/operator somewhere in the loop

•  Combine many data sources –  Increase robustness

•  Fewer false positives –  Detect issues that would normally “fly under the radar”

•  Aggregated input feeds may reveal more

• 20

Analyzing data (cont.)

•  Wait a minute, we already have SNMP! –  E.g. routers can produce traps when something goes wrong

•  Alarms and traps from devices are not enough even for one network –  Network “Black holes”

•  Silent failures: nw devices do not send alarms •  Causes: complex cross-layer interactions, router sw bugs/

misconfigurations, … –  Need detailed and application-specific diagnosis

•  Want to know the causes of failures/problems that raise alarms •  End-to-end diagnosis

–  Diagnosis across administrative domains –  You cannot make an SNMP query to a router in Australian ISPs

network

• 21

Outline





•  Conclusions

Measuring networks

•  Measurements and diagnosis of network properties –  Bandwidth, delay, connectivity, reachability…

•  How? –  Active measurements

•  Probing messages analyzed at the other end •  Clever use of standard protocols

–  ping, traceroute –  Passively collected data (e.g. routing logs)

•  Three example cases –  Topology discovery –  Bandwidth measurements –  Network tomography

• 23

Topology

•  What’s topology? –  Topology describes how the network is laid out

•  Links between routers, switches, etc. •  Not trivial knowledge in large scale networks

•  What’s Internet topology like? –  Internet consists of Autonomous Systems (AS)

•  “a connected group of one or more IP prefixes run by one or more network operators which has a single and clearly defined routing policy” [RFC 1930]

•  E.g. Internet Service Provider (ISP) –  Internet has two-level topology

•  Intra-domain topology –  Within a single network (AS)

•  Inter-domain topology –  Across ASs

• 24

• 25

Internet topology: illustration

Internet topology (cont.)

•  Internet service providers (ISP) grouped in classes –  tier 1: global

•  10-15 •  Internet’s “backbone” •  Settlement free peering: allow each other’s traffic

without charges –  tier 2: regional

•  Both peering and transit services –  tier 3: local

•  Solely transit (buy connectivity from higher tier ISPs

• 26

• 27

Internet topology in 2008

§  A few tens of thousands of ASs o  Size varies

Topology discovery

•  Find out topology of a given network by –  probing (active measurements) –  analyzing logs and/or traffic (passive measurements)

•  Why is it useful? –  Some diagnosis methods rely on accurate topology information

•  E.g. Network tomography needs topology –  Realistic simulation and modeling of the Internet

•  Topology models needed for simulations •  E.g. performance of routing protocols is critically dependent on

topology

• 28

Topology discovery (cont.)

•  Granularity level –  Router-level topologies

•  Reflect physical connectivity between nodes •  Inferred using with e.g. traceroute

–  AS graphs •  Peering relationships between providers/clients •  Inferred from inter-domain routers’ BGP tables •  Could also use traceroute with some additional information

•  Measurement location –  With access to routers (“from inside”)

•  Topology of one network •  Routing monitors (OSPF or IS-IS)

–  No access to routers (“from outside”) •  Multi-AS topology or from end-hosts •  Monitors issue active probes: traceroute

• 29

Topology from inside

•  Routing protocols flood state of each link –  Periodically refresh link state –  Report any changes: link down, up, cost change

•  Monitor listens to link-state messages –  Acts as a regular router

•  AT&T’s OSPFmon or Sprint’s PyRT for IS-IS

•  Combining link states gives the topology –  Easy to maintain, messages report any changes

•  Usually not possible across domains

• 30

Inferring a path from outside: traceroute

• 31

A B

TTL = 1

A.1 A.2 B.2 B.1

TTL = 2

TTL exceeded from A.1

TTL exceeded from B.1

Actual path

Inferred path

A.1 B.1

m t

m t

A traceroute path can be incomplete

•  Load balancing is widely used –  Forward packets differently based on load in different parts of

network –  Can be per-flow or even per-packet –  Traceroute only probes one path

•  Sometimes traceroute has no answer (stars) –  ICMP rate limiting for DoS protection –  Anonymous routers

•  Do not send ICMP replies at all or reply with probe’s destination IP •  Security and privacy concerns

•  Tunnelling (e.g., MPLS) may hide routers –  Routers inside the tunnel may not decrement TTL

• 32

Traceroute under load balancing

• 33

L

B

A C

D

L

A

D

C

TTL = 2

TTL = 3

B

E

E

Missing nodes and links

False link

Actual path

Inferred path

m

m t

t

Traceroute under load balancing (cont.)

•  Even per-flow load balancing causes trouble •  Traceroute uses the destination port as identifier

–  Needs to match probe to response –  Response only has the header of the issued probe

• 34

L

B

A C

D

TTL = 2 Port 2

TTL = 3 Port 3

E m t

Paris traceroute

•  Solves the problem with per-flow load balancing –  Probes to a destination belong to same flow

•  Keep flow IDs constant for probes to specific destination –  Flow ID = src/dest IP & port, TP protocol

•  How to match probes with ICMP responses? –  Need to know which ICMP response corresponds to which

probe

• 35

Paris traceroute

•  Matching probes with ICMP responses –  Vary fields within first eight octets of TP-layer header (included

in ICMP response) –  Keep the flow ID related fields constant –  UDP probes: vary checksum (need to manipulate payload too) –  ICMP probes: vary #seq, but also Identifier -> keep checksum

constant

• 36

L

B

A C

D

TTL = 2 Port 1

TTL = 3 Port 1 E Checksum 3 Checksum 2

m t

4 2 1

1

More traceroute shortcomings

•  Inferred nodes = interfaces, not routers –  Different interfaces have different IP address

•  Coverage depends on monitors and targets –  Misses links and routers –  Some links and routers appear multiple times

• 37

1 A

D

3 B 2

3

2

3 1 m1

t1

m2

t2

C

Actual topology

A.1 m1 t1

m2 t2

Inferred topology

C.1 D.1

C.2

B.3

2

Alias resolution: Map interfaces to routers •  Direct probing

–  IP identifier (IPID) in IP header is usually an increasing per packet (or jiffie) counter

–  Responses from same router have close IPIDs and same TTL

•  Record-route IP option –  Records only up to nine IP

addresses of routers in the path •  Enough in many cases

–  Some routers may drop packets with IP options

•  Security concerns usually –  Can also discover outgoing

interfaces

• 38

A.1 m1 t1

m2 t2

Inferred topology

C.1 D.1

C.2

B.3 same router

Large-scale topology measurements

•  Probing a large topology takes time –  E.g., probing 1200 targets from PlanetLab nodes takes 5

minutes on average (using 30 threads) –  Probing more targets covers more links –  But, getting a topology snapshot takes longer

•  Snapshot may be inaccurate –  Paths may change during snapshot –  To know that a path changed, need to re-probe

• 39

Large-scale topology measurements

•  It is possible to reduce redundant probing –  Topologies have tree like structures with aggregation points –  Can skip redundant segments that are already discovered

• 40

B. Donnet et al.: “Efficient Algorithms for Large-Scale Topology Discovery”. SIGMETRICS 2005.

Outline





•  Conclusions

Bandwidth measurements

•  What? –  Infer the bandwidth of a specific hop or of a whole path –  Capacity = maximum possible throughput –  Available bandwidth = portion of capacity not currently used –  Bulk transfer capacity = throughput that a new single long-lived

TCP connection could obtain

•  Why? –  Network aware applications

•  Server or peer selection •  Route selection in overlay networks

–  QoS measurements

• 42

Challenges

•  Routers and switches do not provide direct feedback to end-hosts –  Except ICMP (traceroute) –  Mostly due to scalability, policy, and simplicity reasons

•  End-to-end bandwidth cannot be measured with SNMP –  No access because of administrative barriers –  Network administrators can query router/switch information only

within own network

• 43

The Internet as a “black box”

•  End-systems can infer network state through end-to-end (e2e) measurements –  Without any explicit feedback from routers –  Objectives: accuracy, speed, minimal intrusiveness

• 44

The Internet

Probing packets

Metrics and definitions

•  Simple example of an end-to-end path

• 45

router1

cross traffic

link1 (access link) router2

cross traffic

link2

source host

destination host

link3 (access link)

Metrics and definitions (cont.)

•  Capacity of this path is 100 Mbps –  Determined by the narrow link

•  Available bandwidth of this path is 50 Mbps –  Determined by the tight link

• 46

narrow link tight link

100 Mbps 90 Mbps 10 Mbps



link capacity

available bandwidth used bandwidth

link1 link3 link2

Measurement techniques

⎟⎟⎠

⎞⎜⎜⎝

⎛Δ=Δ

iinout CL,max

• 47

•  Generally use active probing –  Send packets with a specific inter-

arrival pattern –  Observe the pattern at the other end

•  Example: Packet-pair technique for capacity estimation –  Send two equal-sized packets back-

to-back •  Packet size: L •  Packet tx time at link i: L/Ci

–  P-P dispersion: time interval between first bit of two packets

–  Without any cross traffic, the dispersion at receiver is determined by narrow link: C

LCL

iHiR =⎟⎟

⎠

⎞⎜⎜⎝

⎛=Δ

= ,...,1max

C = path capacity

Δin Δout

L L L L Ci

Incoming packet pair

Outgoing packet pair

Bandwidth estimation with cross traffic

•  Cross traffic packets can affect P-P dispersion –  P-P expansion: capacity underestimation –  P-P compression: capacity overestimation

•  Noise in P-P distribution depends on cross traffic load

• 48

Ideal Packet Dispersion

• 49

•  No cross-traffic

Capacity = (Packet Size) / (Dispersion)

Expansion of Dispersion

• 50

•  Cross-traffic (CT) serviced between PP packets

•  Second packet queues due to Cross Traffic (CT )

à Expansion of dispersion à Under-estimation of capacity

Compression of Dispersion

• 51

•  First packet queueing à Compressed dispersion à Over-estimation

CapProbe

•  CapProbe estimation tool takes cross-traffic into account •  Observations:

–  First packet queues more than the second •  Compression à Over-estimation

–  Second packet queues more than the first •  Expansion à Under-estimation

–  Both are result of probe packets experiencing queuing •  Sum of PP delay includes queuing delay

•  Filter PP samples that do not have minimum queuing time •  Dispersion of PP sample with minimum delay sum reflects

capacity

• 52

Rohit Kapoor et al.: CapProbe: A Simple and Accurate Capacity Estimation Technique. SIGCOMM ‘04

CapProbe approach

•  For each packet pair, CapProbe calculates delay sum: delay(packet_1) + delay(packet_2)

•  A PP with the minimum delay sum points out the capacity

• 53

capacity

Bandwidth estimation tools

•  Many estimation tools & techniques –  Abing, netest, pipechar, STAB, pathneck, IGI/PTR, abget,

Spruce, pathchar, clink, pchar, PPrate, DSLprobe, ABwProbe, …

•  Some practical issues –  Traffic shapers –  Non-FIFO queues

•  More scalable methods –  Passive measurements instead of active measurements

•  E.g. PPrate (2006) for capacity estimation: adapt Pathrate’s algorithm

–  One measurement host instead of two cooperating ones •  abget (2006) for available bandwidth estimation •  DSLprobe for capacity estimation of asymmetric (ADSL) links

• 54

Download - T-110.5116 Computer Networks II– SNMP traps, syslog msgs, trouble tickets, traffic traces etc. • Inference / Analysis – Analyze the input data – E.g. learn that a router link

Top Related