• 23.9.2010
T-110.5116 Computer Networks II Network diagnostics and traffic analysis 12/19.11.2012 Matti Siekkinen
(Sources: R.Teixeira: Internet measurements: fault detection, identification, and topology discovery; S. Kandula: Detailed Diagnosis in Enterprise Networks)
Concerning exam dates
• Network security exam on same date: December 17. – Different time though…
• Now additional exam date: January 3. 2013
• 2
Outline
• What is QoS? – Overview of QoS mechanisms
• Network diagnostics and traffic analysis – What, why, and how?
• Measuring networks – Topology discovery – Bandwidth measurements – Network Tomography
• Traffic analysis – Root cause analysis – Application-level analysis – Traffic anomaly detection
• Conclusions
What is Quality of Service?
• Many applications are sensitive to delay, jitter, and packet loss – Too high values makes utility drop to zero
• Some mission-critical applications cannot tolerate disruption – VoIP – high-availability computing
• Related concept is service availability – How likely is it that I can place a call and not get interrupted? – requires meeting the QoS requirements for the given application
Example QoS Requirements
Personal voice over IP Network
monitoring
CEO Video conference with analysis
Financial Transactions
Interactive whiteboard
Unicast radio
Network management traffic Extranet
web traffic Public web traffic
Push news
Personal e-mail
Business e-mail
Server backups
Sensitive
Insensitive
Casual Critical
Delay
Mission Criticality
How to guarantee QoS
• Provisioning (before any data packets sent) – Admission control
• Prohibit or allow new flows to enter the nw • Make sure we have necessary available bandwidth in network
– Resource reservation • Reserve the necessary available bandwidth in network
• Control (during data transfer) – Scheduling (FIFO, WFQ)
• Which flow gets a piece of resources at a given time instant – Queue mgmt (drop-tail, RED)
• If buffer fills up, which flow do we punish? – Policing (leaky/token bucket)
• Enforce flows to behave according to agreed policy • E.g. send traffic at constant rate R
How to guarantee QoS (cont.)
Packet Scheduling
Admission Control
Traffic Shaping
(Users get their share of bandwidth) (Policing to ctrl
amount of traffic users can inject into the network)
(To accept or reject a flow based on flow specifications)
Core
How to guarantee QoS (cont.)
• These are network-layer techniques – Each router needs to support this – Together allow perfect control of QoS
• Internet does not implement these mechanisms – Works today only within some ISPs network
• Technically we know how to do it Internet wide but other reasons prevent deployment – E.g. lack of business models
What if we had QoS guarantees?
• Your Internet subscription states the SLA – Describes what kind of service you will get – E.g. guaranteed bandwidth of B with max delay of D when there are
no higher priority customers present • How would you perceive that?
– YouTube video would either stream perfectly or might not load at all • May have no admission with your SLA at the moment
– Skype call would never be of bad quality but call can be refused or interrupted
– Downloading file (size S) happens in exactly B*S seconds – Obviously, assuming network is not broken and you have coverage
(wireless)…
• 9
QoS in Today’s Internet
TCP/UDP/IP: “best-effort service” • no guarantees on delay, loss
Today’s Internet applications use application-level techniques to mitigate
(as best possible) effects of delay, loss
But some apps (multimedia) require QoS and level of performance to be
effective!
? ? ? ? ?
?
? ? ?
?
?
What can be done today to control QoS
• Mainly application-level techniques – Application adapts to network conditions
• Buffer stream data, conceal errors, … – Use overlay networks – No need to change anything in routers
• Make the best out of the best effort network – Cannot guarantee anything
• No guarantees means we cannot be sure what kind of QoS we get – Monitoring is important – Enter network diagnostics and traffic analysis…
Outline
• What is QoS? – Overview of QoS mechanisms
• Network diagnostics and traffic analysis – What, why, and how?
• Measuring networks – Topology discovery – Bandwidth measurements – Network Tomography
• Traffic analysis – Root cause analysis – Application-level analysis – Traffic anomaly detection
• Conclusions
Network diagnostics and traffic analysis
• Understand how the network is doing – Detect and diagnose faults (links, routers, …) – Identify performance bottlenecks
• E.g. congested link
• Detect and quarantine misbehaving devices or traffic – Anomaly detection – E.g. misconfigured router, attacks
• Learn what kind of QoS users perceive – Performance evaluation of applications – Analyze resulting traffic to infer perceived QoS – Goal is obviously to improve if possible
• 13
Why bother?
• Keep things going – Stuff breaks down – Operators and admins are human beings and make mistakes – Want to keep the networks operational
• Maximum benefit out of the infrastructure – Equipment costs money – Maximize utilization
• Happy customers – Performance troubles make them unhappy – Unhappy customers may decrease revenues
• 14
Why is it challenging?
• Few built-in diagnosis mechanisms – Today’s networks run on IP – Network elements are “simple” – Intelligence lies at the edges ⇒ May need to use complex end-to-end methods to measure
simple things (e.g. link capacity)
• Scale can be very large – Traffic volumes – Number of nodes – Different services and protocols ⇒ Diagnosis techniques need to be scalable too
• 15
Diagnosing networks
• Obtain some input data – SNMP traps, syslog msgs, trouble tickets, traffic
traces etc.
• Inference / Analysis – Analyze the input data – E.g. learn that a router link is down
• Do something about it – E.g. start fixing the link
• 16
Collect raw measurements
Analyze measurements
Use learned information
Ways to collect data for diagnosis
• Management tools – Ask the devices how they are doing – Receive alarms, traps – E.g. SNMP
• Passive measurements – Simply record what you observe – E.g. Cisco’s Netflow traffic data or raw traffic header traces
• Active measurements – Send probes and observe what happens to them – E.g. tomography, bandwidth measurements
• 17
Where to collect measurements?
• Network aggregation points – Router, switch – Access, gateway, backbone – Depends on scale, available methods,
and objectives
• Client or server – Usually limited possibilities – Possible in data center networks
• 18
Backbone router
Access router
Customers Customers
Gateway router
ISP 2 ISP 3
ISP 1
Analyzing data
• On-line – Perform (at least a part of) the analysis on the observed data in a
real-time manner ☺ Data reduction -> don’t store everything ☺ Can react quickly ☹ Scalability
• 10 Gbit/s link produces >8 MB/s of uncompressed packet headers • May need sampling, aggregation
☹ Do not necessarily have all the raw data for later analysis • Off-line
– Record data into persistent storage and analyze later ☺ Run complex time-consuming analysis ☹ Not for time critical analysis ☹ Storage issues
• 19
Analyzing data (cont.)
• Human vs. machine – Statistical analysis and data mining techniques – Reveal non-trivial patterns (aggregate/similar behavior,
anomalies) – Still need an admin/operator somewhere in the loop
• Combine many data sources – Increase robustness
• Fewer false positives – Detect issues that would normally “fly under the radar”
• Aggregated input feeds may reveal more
• 20
Analyzing data (cont.)
• Wait a minute, we already have SNMP! – E.g. routers can produce traps when something goes wrong
• Alarms and traps from devices are not enough even for one network – Network “Black holes”
• Silent failures: nw devices do not send alarms • Causes: complex cross-layer interactions, router sw bugs/
misconfigurations, … – Need detailed and application-specific diagnosis
• Want to know the causes of failures/problems that raise alarms • End-to-end diagnosis
– Diagnosis across administrative domains – You cannot make an SNMP query to a router in Australian ISPs
network
• 21
Outline
• What is QoS? – Overview of QoS mechanisms
• Network diagnostics and traffic analysis – What, why, and how?
• Measuring networks – Topology discovery – Bandwidth measurements – Network Tomography
• Traffic analysis – Root cause analysis – Application-level analysis – Traffic anomaly detection
• Conclusions
Measuring networks
• Measurements and diagnosis of network properties – Bandwidth, delay, connectivity, reachability…
• How? – Active measurements
• Probing messages analyzed at the other end • Clever use of standard protocols
– ping, traceroute – Passively collected data (e.g. routing logs)
• Three example cases – Topology discovery – Bandwidth measurements – Network tomography
• 23
Topology
• What’s topology? – Topology describes how the network is laid out
• Links between routers, switches, etc. • Not trivial knowledge in large scale networks
• What’s Internet topology like? – Internet consists of Autonomous Systems (AS)
• “a connected group of one or more IP prefixes run by one or more network operators which has a single and clearly defined routing policy” [RFC 1930]
• E.g. Internet Service Provider (ISP) – Internet has two-level topology
• Intra-domain topology – Within a single network (AS)
• Inter-domain topology – Across ASs
• 24
• 25
Internet topology: illustration
Internet topology (cont.)
• Internet service providers (ISP) grouped in classes – tier 1: global
• 10-15 • Internet’s “backbone” • Settlement free peering: allow each other’s traffic
without charges – tier 2: regional
• Both peering and transit services – tier 3: local
• Solely transit (buy connectivity from higher tier ISPs
• 26
• 27
Internet topology in 2008
§ A few tens of thousands of ASs o Size varies
Topology discovery
• Find out topology of a given network by – probing (active measurements) – analyzing logs and/or traffic (passive measurements)
• Why is it useful? – Some diagnosis methods rely on accurate topology information
• E.g. Network tomography needs topology – Realistic simulation and modeling of the Internet
• Topology models needed for simulations • E.g. performance of routing protocols is critically dependent on
topology
• 28
Topology discovery (cont.)
• Granularity level – Router-level topologies
• Reflect physical connectivity between nodes • Inferred using with e.g. traceroute
– AS graphs • Peering relationships between providers/clients • Inferred from inter-domain routers’ BGP tables • Could also use traceroute with some additional information
• Measurement location – With access to routers (“from inside”)
• Topology of one network • Routing monitors (OSPF or IS-IS)
– No access to routers (“from outside”) • Multi-AS topology or from end-hosts • Monitors issue active probes: traceroute
• 29
Topology from inside
• Routing protocols flood state of each link – Periodically refresh link state – Report any changes: link down, up, cost change
• Monitor listens to link-state messages – Acts as a regular router
• AT&T’s OSPFmon or Sprint’s PyRT for IS-IS
• Combining link states gives the topology – Easy to maintain, messages report any changes
• Usually not possible across domains
• 30
Inferring a path from outside: traceroute
• 31
A B
TTL = 1
A.1 A.2 B.2 B.1
TTL = 2
TTL exceeded from A.1
TTL exceeded from B.1
Actual path
Inferred path
A.1 B.1
m t
m t
A traceroute path can be incomplete
• Load balancing is widely used – Forward packets differently based on load in different parts of
network – Can be per-flow or even per-packet – Traceroute only probes one path
• Sometimes traceroute has no answer (stars) – ICMP rate limiting for DoS protection – Anonymous routers
• Do not send ICMP replies at all or reply with probe’s destination IP • Security and privacy concerns
• Tunnelling (e.g., MPLS) may hide routers – Routers inside the tunnel may not decrement TTL
• 32
Traceroute under load balancing
• 33
L
B
A C
D
L
A
D
C
TTL = 2
TTL = 3
B
E
E
Missing nodes and links
False link
Actual path
Inferred path
m
m t
t
Traceroute under load balancing (cont.)
• Even per-flow load balancing causes trouble • Traceroute uses the destination port as identifier
– Needs to match probe to response – Response only has the header of the issued probe
• 34
L
B
A C
D
TTL = 2 Port 2
TTL = 3 Port 3
E m t
Paris traceroute
• Solves the problem with per-flow load balancing – Probes to a destination belong to same flow
• Keep flow IDs constant for probes to specific destination – Flow ID = src/dest IP & port, TP protocol
• How to match probes with ICMP responses? – Need to know which ICMP response corresponds to which
probe
• 35
Paris traceroute
• Matching probes with ICMP responses – Vary fields within first eight octets of TP-layer header (included
in ICMP response) – Keep the flow ID related fields constant – UDP probes: vary checksum (need to manipulate payload too) – ICMP probes: vary #seq, but also Identifier -> keep checksum
constant
• 36
L
B
A C
D
TTL = 2 Port 1
TTL = 3 Port 1 E Checksum 3 Checksum 2
m t
4 2 1
1
More traceroute shortcomings
• Inferred nodes = interfaces, not routers – Different interfaces have different IP address
• Coverage depends on monitors and targets – Misses links and routers – Some links and routers appear multiple times
• 37
1 A
D
3 B 2
3
2
3 1 m1
t1
m2
t2
C
Actual topology
A.1 m1 t1
m2 t2
Inferred topology
C.1 D.1
C.2
B.3
2
Alias resolution: Map interfaces to routers • Direct probing
– IP identifier (IPID) in IP header is usually an increasing per packet (or jiffie) counter
– Responses from same router have close IPIDs and same TTL
• Record-route IP option – Records only up to nine IP
addresses of routers in the path • Enough in many cases
– Some routers may drop packets with IP options
• Security concerns usually – Can also discover outgoing
interfaces
• 38
A.1 m1 t1
m2 t2
Inferred topology
C.1 D.1
C.2
B.3 same router
Large-scale topology measurements
• Probing a large topology takes time – E.g., probing 1200 targets from PlanetLab nodes takes 5
minutes on average (using 30 threads) – Probing more targets covers more links – But, getting a topology snapshot takes longer
• Snapshot may be inaccurate – Paths may change during snapshot – To know that a path changed, need to re-probe
• 39
Large-scale topology measurements
• It is possible to reduce redundant probing – Topologies have tree like structures with aggregation points – Can skip redundant segments that are already discovered
• 40
B. Donnet et al.: “Efficient Algorithms for Large-Scale Topology Discovery”. SIGMETRICS 2005.
Outline
• What is QoS? – Overview of QoS mechanisms
• Network diagnostics and traffic analysis – What, why, and how?
• Measuring networks – Topology discovery – Bandwidth measurements – Network Tomography
• Traffic analysis – Root cause analysis – Application-level analysis – Traffic anomaly detection
• Conclusions
Bandwidth measurements
• What? – Infer the bandwidth of a specific hop or of a whole path – Capacity = maximum possible throughput – Available bandwidth = portion of capacity not currently used – Bulk transfer capacity = throughput that a new single long-lived
TCP connection could obtain
• Why? – Network aware applications
• Server or peer selection • Route selection in overlay networks
– QoS measurements
• 42
Challenges
• Routers and switches do not provide direct feedback to end-hosts – Except ICMP (traceroute) – Mostly due to scalability, policy, and simplicity reasons
• End-to-end bandwidth cannot be measured with SNMP – No access because of administrative barriers – Network administrators can query router/switch information only
within own network
• 43
The Internet as a “black box”
• End-systems can infer network state through end-to-end (e2e) measurements – Without any explicit feedback from routers – Objectives: accuracy, speed, minimal intrusiveness
• 44
The Internet
Probing packets
Metrics and definitions
• Simple example of an end-to-end path
• 45
router1
cross traffic
link1 (access link) router2
cross traffic
link2
source host
destination host
link3 (access link)
Metrics and definitions (cont.)
• Capacity of this path is 100 Mbps – Determined by the narrow link
• Available bandwidth of this path is 50 Mbps – Determined by the tight link
• 46
narrow link tight link
100 Mbps 90 Mbps 10 Mbps
2500 Mbps 1300 Mbps 1200 Mbps
1000 Mbps 50 Mbps 950 Mbps
link capacity
available bandwidth used bandwidth
link1 link3 link2
Measurement techniques
⎟⎟⎠
⎞⎜⎜⎝
⎛Δ=Δ
iinout CL,max
• 47
• Generally use active probing – Send packets with a specific inter-
arrival pattern – Observe the pattern at the other end
• Example: Packet-pair technique for capacity estimation – Send two equal-sized packets back-
to-back • Packet size: L • Packet tx time at link i: L/Ci
– P-P dispersion: time interval between first bit of two packets
– Without any cross traffic, the dispersion at receiver is determined by narrow link: C
LCL
iHiR =⎟⎟
⎠
⎞⎜⎜⎝
⎛=Δ
= ,...,1max
C = path capacity
Δin Δout
L L L L Ci
Incoming packet pair
Outgoing packet pair
Bandwidth estimation with cross traffic
• Cross traffic packets can affect P-P dispersion – P-P expansion: capacity underestimation – P-P compression: capacity overestimation
• Noise in P-P distribution depends on cross traffic load
• 48
Ideal Packet Dispersion
• 49
• No cross-traffic
Capacity = (Packet Size) / (Dispersion)
Expansion of Dispersion
• 50
• Cross-traffic (CT) serviced between PP packets
• Second packet queues due to Cross Traffic (CT )
à Expansion of dispersion à Under-estimation of capacity
Compression of Dispersion
• 51
• First packet queueing à Compressed dispersion à Over-estimation
CapProbe
• CapProbe estimation tool takes cross-traffic into account • Observations:
– First packet queues more than the second • Compression à Over-estimation
– Second packet queues more than the first • Expansion à Under-estimation
– Both are result of probe packets experiencing queuing • Sum of PP delay includes queuing delay
• Filter PP samples that do not have minimum queuing time • Dispersion of PP sample with minimum delay sum reflects
capacity
• 52
Rohit Kapoor et al.: CapProbe: A Simple and Accurate Capacity Estimation Technique. SIGCOMM ‘04
CapProbe approach
• For each packet pair, CapProbe calculates delay sum: delay(packet_1) + delay(packet_2)
• A PP with the minimum delay sum points out the capacity
• 53
capacity
Bandwidth estimation tools
• Many estimation tools & techniques – Abing, netest, pipechar, STAB, pathneck, IGI/PTR, abget,
Spruce, pathchar, clink, pchar, PPrate, DSLprobe, ABwProbe, …
• Some practical issues – Traffic shapers – Non-FIFO queues
• More scalable methods – Passive measurements instead of active measurements
• E.g. PPrate (2006) for capacity estimation: adapt Pathrate’s algorithm
– One measurement host instead of two cooperating ones • abget (2006) for available bandwidth estimation • DSLprobe for capacity estimation of asymmetric (ADSL) links
• 54