t-110.5116 computer networks ii - aalto university
TRANSCRIPT
• 23.9.2010
T-110.5116 Computer Networks II Network diagnostics and traffic analysis 12/19.11.2012 Matti Siekkinen
(Sources: R.Teixeira: Internet measurements: fault detection, identification, and topology discovery; S. Kandula: Detailed Diagnosis in Enterprise Networks)
Concerning exam dates
• Network security exam on same date: December 17. – Different time though…
• Now additional exam date: January 3. 2013
• 2
Outline
• What is QoS? – Overview of QoS mechanisms
• Network diagnostics and traffic analysis – What, why, and how?
• Measuring networks – Topology discovery – Bandwidth measurements – Network Tomography
• Traffic analysis – Root cause analysis – Application-level analysis – Traffic anomaly detection
• Conclusions
What is Quality of Service?
• Many applications are sensitive to delay, jitter, and packet loss – Too high values makes utility drop to zero
• Some mission-critical applications cannot tolerate disruption – VoIP – high-availability computing
• Related concept is service availability – How likely is it that I can place a call and not get interrupted? – requires meeting the QoS requirements for the given application
Example QoS Requirements
Personal voice over IP Network
monitoring
CEO Video conference with analysis
Financial Transactions
Interactive whiteboard
Unicast radio
Network management traffic Extranet
web traffic Public web traffic
Push news
Personal e-mail
Business e-mail
Server backups
Sensitive
Insensitive
Casual Critical
Delay
Mission Criticality
How to guarantee QoS
• Provisioning (before any data packets sent) – Admission control
• Prohibit or allow new flows to enter the nw • Make sure we have necessary available bandwidth in network
– Resource reservation • Reserve the necessary available bandwidth in network
• Control (during data transfer) – Scheduling (FIFO, WFQ)
• Which flow gets a piece of resources at a given time instant – Queue mgmt (drop-tail, RED)
• If buffer fills up, which flow do we punish? – Policing (leaky/token bucket)
• Enforce flows to behave according to agreed policy • E.g. send traffic at constant rate R
How to guarantee QoS (cont.)
Packet Scheduling
Admission Control
Traffic Shaping
(Users get their share of bandwidth) (Policing to ctrl
amount of traffic users can inject into the network)
(To accept or reject a flow based on flow specifications)
Core
How to guarantee QoS (cont.)
• These are network-layer techniques – Each router needs to support this – Together allow perfect control of QoS
• Internet does not implement these mechanisms – Works today only within some ISPs network
• Technically we know how to do it Internet wide but other reasons prevent deployment – E.g. lack of business models
What if we had QoS guarantees?
• Your Internet subscription states the SLA – Describes what kind of service you will get – E.g. guaranteed bandwidth of B with max delay of D when there are
no higher priority customers present • How would you perceive that?
– YouTube video would either stream perfectly or might not load at all • May have no admission with your SLA at the moment
– Skype call would never be of bad quality but call can be refused or interrupted
– Downloading file (size S) happens in exactly B*S seconds – Obviously, assuming network is not broken and you have coverage
(wireless)…
• 9
QoS in Today’s Internet
TCP/UDP/IP: “best-effort service” • no guarantees on delay, loss
Today’s Internet applications use application-level techniques to mitigate
(as best possible) effects of delay, loss
But some apps (multimedia) require QoS and level of performance to be
effective!
? ? ? ? ?
?
? ? ?
?
?
What can be done today to control QoS
• Mainly application-level techniques – Application adapts to network conditions
• Buffer stream data, conceal errors, … – Use overlay networks – No need to change anything in routers
• Make the best out of the best effort network – Cannot guarantee anything
• No guarantees means we cannot be sure what kind of QoS we get – Monitoring is important – Enter network diagnostics and traffic analysis…
Outline
• What is QoS? – Overview of QoS mechanisms
• Network diagnostics and traffic analysis – What, why, and how?
• Measuring networks – Topology discovery – Bandwidth measurements – Network Tomography
• Traffic analysis – Root cause analysis – Application-level analysis – Traffic anomaly detection
• Conclusions
Network diagnostics and traffic analysis
• Understand how the network is doing – Detect and diagnose faults (links, routers, …) – Identify performance bottlenecks
• E.g. congested link
• Detect and quarantine misbehaving devices or traffic – Anomaly detection – E.g. misconfigured router, attacks
• Learn what kind of QoS users perceive – Performance evaluation of applications – Analyze resulting traffic to infer perceived QoS – Goal is obviously to improve if possible
• 13
Why bother?
• Keep things going – Stuff breaks down – Operators and admins are human beings and make mistakes – Want to keep the networks operational
• Maximum benefit out of the infrastructure – Equipment costs money – Maximize utilization
• Happy customers – Performance troubles make them unhappy – Unhappy customers may decrease revenues
• 14
Why is it challenging?
• Few built-in diagnosis mechanisms – Today’s networks run on IP – Network elements are “simple” – Intelligence lies at the edges ⇒ May need to use complex end-to-end methods to measure
simple things (e.g. link capacity)
• Scale can be very large – Traffic volumes – Number of nodes – Different services and protocols ⇒ Diagnosis techniques need to be scalable too
• 15
Diagnosing networks
• Obtain some input data – SNMP traps, syslog msgs, trouble tickets, traffic
traces etc.
• Inference / Analysis – Analyze the input data – E.g. learn that a router link is down
• Do something about it – E.g. start fixing the link
• 16
Collect raw measurements
Analyze measurements
Use learned information
Ways to collect data for diagnosis
• Management tools – Ask the devices how they are doing – Receive alarms, traps – E.g. SNMP
• Passive measurements – Simply record what you observe – E.g. Cisco’s Netflow traffic data or raw traffic header traces
• Active measurements – Send probes and observe what happens to them – E.g. tomography, bandwidth measurements
• 17
Where to collect measurements?
• Network aggregation points – Router, switch – Access, gateway, backbone – Depends on scale, available methods,
and objectives
• Client or server – Usually limited possibilities – Possible in data center networks
• 18
Backbone router
Access router
Customers Customers
Gateway router
ISP 2 ISP 3
ISP 1
Analyzing data
• On-line – Perform (at least a part of) the analysis on the observed data in a
real-time manner ☺ Data reduction -> don’t store everything ☺ Can react quickly ☹ Scalability
• 10 Gbit/s link produces >8 MB/s of uncompressed packet headers • May need sampling, aggregation
☹ Do not necessarily have all the raw data for later analysis • Off-line
– Record data into persistent storage and analyze later ☺ Run complex time-consuming analysis ☹ Not for time critical analysis ☹ Storage issues
• 19
Analyzing data (cont.)
• Human vs. machine – Statistical analysis and data mining techniques – Reveal non-trivial patterns (aggregate/similar behavior,
anomalies) – Still need an admin/operator somewhere in the loop
• Combine many data sources – Increase robustness
• Fewer false positives – Detect issues that would normally “fly under the radar”
• Aggregated input feeds may reveal more
• 20
Analyzing data (cont.)
• Wait a minute, we already have SNMP! – E.g. routers can produce traps when something goes wrong
• Alarms and traps from devices are not enough even for one network – Network “Black holes”
• Silent failures: nw devices do not send alarms • Causes: complex cross-layer interactions, router sw bugs/
misconfigurations, … – Need detailed and application-specific diagnosis
• Want to know the causes of failures/problems that raise alarms • End-to-end diagnosis
– Diagnosis across administrative domains – You cannot make an SNMP query to a router in Australian ISPs
network
• 21
Outline
• What is QoS? – Overview of QoS mechanisms
• Network diagnostics and traffic analysis – What, why, and how?
• Measuring networks – Topology discovery – Bandwidth measurements – Network Tomography
• Traffic analysis – Root cause analysis – Application-level analysis – Traffic anomaly detection
• Conclusions
Measuring networks
• Measurements and diagnosis of network properties – Bandwidth, delay, connectivity, reachability…
• How? – Active measurements
• Probing messages analyzed at the other end • Clever use of standard protocols
– ping, traceroute – Passively collected data (e.g. routing logs)
• Three example cases – Topology discovery – Bandwidth measurements – Network tomography
• 23
Topology
• What’s topology? – Topology describes how the network is laid out
• Links between routers, switches, etc. • Not trivial knowledge in large scale networks
• What’s Internet topology like? – Internet consists of Autonomous Systems (AS)
• “a connected group of one or more IP prefixes run by one or more network operators which has a single and clearly defined routing policy” [RFC 1930]
• E.g. Internet Service Provider (ISP) – Internet has two-level topology
• Intra-domain topology – Within a single network (AS)
• Inter-domain topology – Across ASs
• 24
Internet topology (cont.)
• Internet service providers (ISP) grouped in classes – tier 1: global
• 10-15 • Internet’s “backbone” • Settlement free peering: allow each other’s traffic
without charges – tier 2: regional
• Both peering and transit services – tier 3: local
• Solely transit (buy connectivity from higher tier ISPs
• 26
Topology discovery
• Find out topology of a given network by – probing (active measurements) – analyzing logs and/or traffic (passive measurements)
• Why is it useful? – Some diagnosis methods rely on accurate topology information
• E.g. Network tomography needs topology – Realistic simulation and modeling of the Internet
• Topology models needed for simulations • E.g. performance of routing protocols is critically dependent on
topology
• 28
Topology discovery (cont.)
• Granularity level – Router-level topologies
• Reflect physical connectivity between nodes • Inferred using with e.g. traceroute
– AS graphs • Peering relationships between providers/clients • Inferred from inter-domain routers’ BGP tables • Could also use traceroute with some additional information
• Measurement location – With access to routers (“from inside”)
• Topology of one network • Routing monitors (OSPF or IS-IS)
– No access to routers (“from outside”) • Multi-AS topology or from end-hosts • Monitors issue active probes: traceroute
• 29
Topology from inside
• Routing protocols flood state of each link – Periodically refresh link state – Report any changes: link down, up, cost change
• Monitor listens to link-state messages – Acts as a regular router
• AT&T’s OSPFmon or Sprint’s PyRT for IS-IS
• Combining link states gives the topology – Easy to maintain, messages report any changes
• Usually not possible across domains
• 30
Inferring a path from outside: traceroute
• 31
A B
TTL = 1
A.1 A.2 B.2 B.1
TTL = 2
TTL exceeded from A.1
TTL exceeded from B.1
Actual path
Inferred path
A.1 B.1
m t
m t
A traceroute path can be incomplete
• Load balancing is widely used – Forward packets differently based on load in different parts of
network – Can be per-flow or even per-packet – Traceroute only probes one path
• Sometimes traceroute has no answer (stars) – ICMP rate limiting for DoS protection – Anonymous routers
• Do not send ICMP replies at all or reply with probe’s destination IP • Security and privacy concerns
• Tunnelling (e.g., MPLS) may hide routers – Routers inside the tunnel may not decrement TTL
• 32
Traceroute under load balancing
• 33
L
B
A C
D
L
A
D
C
TTL = 2
TTL = 3
B
E
E
Missing nodes and links
False link
Actual path
Inferred path
m
m t
t
Traceroute under load balancing (cont.)
• Even per-flow load balancing causes trouble • Traceroute uses the destination port as identifier
– Needs to match probe to response – Response only has the header of the issued probe
• 34
L
B
A C
D
TTL = 2 Port 2
TTL = 3 Port 3
E m t
Paris traceroute
• Solves the problem with per-flow load balancing – Probes to a destination belong to same flow
• Keep flow IDs constant for probes to specific destination – Flow ID = src/dest IP & port, TP protocol
• How to match probes with ICMP responses? – Need to know which ICMP response corresponds to which
probe
• 35
Paris traceroute
• Matching probes with ICMP responses – Vary fields within first eight octets of TP-layer header (included
in ICMP response) – Keep the flow ID related fields constant – UDP probes: vary checksum (need to manipulate payload too) – ICMP probes: vary #seq, but also Identifier -> keep checksum
constant
• 36
L
B
A C
D
TTL = 2 Port 1
TTL = 3 Port 1 E Checksum 3 Checksum 2
m t
4 2 1
1
More traceroute shortcomings
• Inferred nodes = interfaces, not routers – Different interfaces have different IP address
• Coverage depends on monitors and targets – Misses links and routers – Some links and routers appear multiple times
• 37
1 A
D
3 B 2
3
2
3 1 m1
t1
m2
t2
C
Actual topology
A.1 m1 t1
m2 t2
Inferred topology
C.1 D.1
C.2
B.3
2
Alias resolution: Map interfaces to routers • Direct probing
– IP identifier (IPID) in IP header is usually an increasing per packet (or jiffie) counter
– Responses from same router have close IPIDs and same TTL
• Record-route IP option – Records only up to nine IP
addresses of routers in the path • Enough in many cases
– Some routers may drop packets with IP options
• Security concerns usually – Can also discover outgoing
interfaces
• 38
A.1 m1 t1
m2 t2
Inferred topology
C.1 D.1
C.2
B.3 same router
Large-scale topology measurements
• Probing a large topology takes time – E.g., probing 1200 targets from PlanetLab nodes takes 5
minutes on average (using 30 threads) – Probing more targets covers more links – But, getting a topology snapshot takes longer
• Snapshot may be inaccurate – Paths may change during snapshot – To know that a path changed, need to re-probe
• 39
Large-scale topology measurements
• It is possible to reduce redundant probing – Topologies have tree like structures with aggregation points – Can skip redundant segments that are already discovered
• 40
B. Donnet et al.: “Efficient Algorithms for Large-Scale Topology Discovery”. SIGMETRICS 2005.
Outline
• What is QoS? – Overview of QoS mechanisms
• Network diagnostics and traffic analysis – What, why, and how?
• Measuring networks – Topology discovery – Bandwidth measurements – Network Tomography
• Traffic analysis – Root cause analysis – Application-level analysis – Traffic anomaly detection
• Conclusions
Bandwidth measurements
• What? – Infer the bandwidth of a specific hop or of a whole path – Capacity = maximum possible throughput – Available bandwidth = portion of capacity not currently used – Bulk transfer capacity = throughput that a new single long-lived
TCP connection could obtain
• Why? – Network aware applications
• Server or peer selection • Route selection in overlay networks
– QoS measurements
• 42
Challenges
• Routers and switches do not provide direct feedback to end-hosts – Except ICMP (traceroute) – Mostly due to scalability, policy, and simplicity reasons
• End-to-end bandwidth cannot be measured with SNMP – No access because of administrative barriers – Network administrators can query router/switch information only
within own network
• 43
The Internet as a “black box”
• End-systems can infer network state through end-to-end (e2e) measurements – Without any explicit feedback from routers – Objectives: accuracy, speed, minimal intrusiveness
• 44
The Internet
Probing packets
Metrics and definitions
• Simple example of an end-to-end path
• 45
router1
cross traffic
link1 (access link) router2
cross traffic
link2
source host
destination host
link3 (access link)
Metrics and definitions (cont.)
• Capacity of this path is 100 Mbps – Determined by the narrow link
• Available bandwidth of this path is 50 Mbps – Determined by the tight link
• 46
narrow link tight link
100 Mbps 90 Mbps 10 Mbps
2500 Mbps 1300 Mbps 1200 Mbps
1000 Mbps 50 Mbps 950 Mbps
link capacity
available bandwidth used bandwidth
link1 link3 link2
Measurement techniques
⎟⎟⎠
⎞⎜⎜⎝
⎛Δ=Δ
iinout CL,max
• 47
• Generally use active probing – Send packets with a specific inter-
arrival pattern – Observe the pattern at the other end
• Example: Packet-pair technique for capacity estimation – Send two equal-sized packets back-
to-back • Packet size: L • Packet tx time at link i: L/Ci
– P-P dispersion: time interval between first bit of two packets
– Without any cross traffic, the dispersion at receiver is determined by narrow link: C
LCL
iHiR =⎟⎟
⎠
⎞⎜⎜⎝
⎛=Δ
= ,...,1max
C = path capacity
Δin Δout
L L L L Ci
Incoming packet pair
Outgoing packet pair
Bandwidth estimation with cross traffic
• Cross traffic packets can affect P-P dispersion – P-P expansion: capacity underestimation – P-P compression: capacity overestimation
• Noise in P-P distribution depends on cross traffic load
• 48
Expansion of Dispersion
• 50
• Cross-traffic (CT) serviced between PP packets
• Second packet queues due to Cross Traffic (CT )
à Expansion of dispersion à Under-estimation of capacity
CapProbe
• CapProbe estimation tool takes cross-traffic into account • Observations:
– First packet queues more than the second • Compression à Over-estimation
– Second packet queues more than the first • Expansion à Under-estimation
– Both are result of probe packets experiencing queuing • Sum of PP delay includes queuing delay
• Filter PP samples that do not have minimum queuing time • Dispersion of PP sample with minimum delay sum reflects
capacity
• 52
Rohit Kapoor et al.: CapProbe: A Simple and Accurate Capacity Estimation Technique. SIGCOMM ‘04
CapProbe approach
• For each packet pair, CapProbe calculates delay sum: delay(packet_1) + delay(packet_2)
• A PP with the minimum delay sum points out the capacity
• 53
capacity
Bandwidth estimation tools
• Many estimation tools & techniques – Abing, netest, pipechar, STAB, pathneck, IGI/PTR, abget,
Spruce, pathchar, clink, pchar, PPrate, DSLprobe, ABwProbe, …
• Some practical issues – Traffic shapers – Non-FIFO queues
• More scalable methods – Passive measurements instead of active measurements
• E.g. PPrate (2006) for capacity estimation: adapt Pathrate’s algorithm
– One measurement host instead of two cooperating ones • abget (2006) for available bandwidth estimation • DSLprobe for capacity estimation of asymmetric (ADSL) links
• 54
Outline
• What is QoS? – Overview of QoS mechanisms
• Network diagnostics and traffic analysis – What, why, and how?
• Measuring networks – Topology discovery – Bandwidth measurements – Network Tomography
• Traffic analysis – Root cause analysis – Application-level analysis – Traffic anomaly detection
• Conclusions
Different viewpoints of the network
• Network operators only have data of one AS – AS4 doesn’t detect any problem – AS3 doesn’t know who is affected by the failure
• End-hosts don’t know what happens in the network – Can only monitor end-to-end paths
• 56
AS1
AS2 AS3
AS4
Network Tomography
• View network as “black box” – Probe network from the edge
• Fault diagnosis with tomography – Diagnose persistent reachability problems
across domains – Useful to detect black holes
• Silent failures that do not produce alerts!
• 57
Network tomography: approach
• Two phase approach – Detect: End-to-end path monitoring – Localize: Binary tomography
• First detect whether there is a failure on a path – Continuous monitoring – Packet loss as indicator for failure
• Then, try to localize the failure – Figure out which path segment may be broken – Need topology information
• 58
Fault detection: end-to-end monitoring
• What is monitored? – Different properties of network links – Loss rate, delay, bandwidth, connectivity
• Who monitors? – Network operators
• In network monitoring hosts • Third-party monitoring services • From home gateways
– End users • Cooperative monitoring • Users of popular services/applications
• 59
Fault detection: end-to-end monitoring (cont.)
• How monitored? – End-to-end: from probe senders to collectors – No access to routers – Using multicast probes would be efficient
• IP multicast deployments limited in practice L • Use unicast in practice
• 60
probe sender
probe collectors
Monitoring techniques
• Active probing: ping – Send probe, collect response – From any end host
• Works for network operators and end users
• Passive analysis of user’s traffic – Tap incoming and outgoing traffic
• At user’s machines or servers: tcpdump, pcap • Inside the network: DAG card
– Monitor status of TCP connections
• 61
Passive fault detection
• At end hosts – tcpdump/pcap captures packets – Track status of each TCP connection
• RTTs, timeouts, retransmissions – Multiple timeouts indicate path is bad
• More challenging inside the network – Traffic volume is high
• Need special hardware such as DAG cards • May need to use sampling
– Tracking TCP connections is hard • May not capture both sides of a connection • Large processing and memory overhead
• 62
Active fault detection with ping
• If receives reply – Then, path is good
• If no reply before timeout – Then, path is bad
• 63
m
t probe ICMP
echo request
reply ICMP
echo reply
Passive vs. active detection
Passive + No need to inject traffic + Detects all failures that
affect user’s traffic + Responses from targets
that don’t respond to ping
Active + No need to tap user’s
traffic + Detects failures in any
desired path + Fast detection
• 64
‒ Not always possible to tap user’s traffic ‒ Only detects failures in paths with traffic
‒ Probing overhead ‒ Deployment overhead
We focus on active probing in the rest of the slides
Persistent failure or measurement noise?
• Many reasons to lose probe or reply – Timeout may be too short – Rate limiting at routers – Some end-hosts don’t respond to ICMP request – Transient congestion – Routing change
• Need to confirm that failure is persistent – Otherwise, may trigger false alarms
• 65
Failure confirmation
• Upon detection of a failure, trigger extra probes • Two main parameters
– Number of probes • Too large nb perturbs path and may cause additional losses
– Time between probes • Periodic? Which interval?
• Parameter values determine the robustness and reactivity – Make sure that there is indeed a persistent failure – Do not spend too long in confirming the failure
• 66
time
loss burst packets on
a path
false positive I. Cunha et al: “Measurement Methods for Fast and Accurate Blackhole Identification with Binary Tomography”. IMC 2009.
Fault localization
• After detecting a failure, try to localize it – Need to know where intervention is required
• Binary tomography with multicast probes • Labels paths as good or bad
– Loss-rate estimation requires tight correlation – Instead, separate good/bad performance – If link is bad, all paths that cross the link are bad
• Find the smallest set of links that explains bad paths – Given bad links are uncommon – Bad link is the root of maximal bad subtree
• 67
m
t1 t2
bad
1 1 0 1 0 1
good bad
Binary tomography in practice
• Multiple sources and targets • Iterative greedy algorithm
– Given the set of links in bad paths – Iteratively choose link that explains the max
number of bad paths
• Topology may be unknown – Need to measure accurate topology
• Multicast not available – Need to extract correlation from unicast probes
• 68
m2
t1 t2
m1
Hitting set of link = paths that traverse
the link
Greedy approximation
• Method based on spatial correlation – Intersect set of OD-pairs that experienced failures to discover shared
links – Shared links are most likely explanation
• Called hypothesis
• Apply greedy approximation approach – Can have single, dual, multiple failures – Unfeasible to explore all possibilities – Just look for the most likely explanation
• 69
{G-H} as the most likely explanation for the black hole
Greedy approximation (cont.)
• Failure detection generates a failure signature – Set of lost probes for all OD-pair – A.k.a. reachability matrix
• MAX-COVERAGE algorithm – Pick link that explains most nb of observations – Add link to hypothesis – Remove corresponding observations from failure signature – Repeat above iteratively until no more observations in failure
signature remains
Set of OD-pairs in failure signature
Failed links
Greedy approximation: example
• 71
Failure signature: AD FC BE
Failed links: AG GH HD FG HC HE BG
observations
likely causes
MAX-COVERAGE algorithm: 1. Pick link that explains most nb of observations 2. Add link to hypothesis 3. Remove corresponding observations from failure signature 4. Repeat above iteratively until no more observations in failure signature remains
Hypothesis: GH
Greedy approximation (cont.)
• May end up with many candidate links in the hypothesis – It is likely with complex topologies
• How to select the candidate link within hypothesis? – Pick links that explain at least n observations – Pick links that explain at least fraction f of observations
• Pros & cons of greedy approximation – Avoid having to explore all possibilities – Can localize multiple failures – Bias in favor of links present in many paths
R. R. Kompella, J. Yates, A. Greenberg, A. C. Snoeren, “Detection and Localization of Network Blackholes”, IEEE INFOCOM, 2007.
Problem of synchronization
• There is no perfect synchronization between measurements – Probes cross links at different times – Path may change between probes – Leads to inconsistencies
• Measurements from a single monitor – Probing all targets can take time
• Measurements from multiple monitors – Impossible to synchronize probes from
different monitors to reach each shared link at the same time
• 73
m
t1 t2
mistakenly inferred failure
Problem of synchronization (cont.)
• 74
m1
t1 tN
mK
mK,t1
mK,tN …
m1,t1
m1,tN
…
path reachability
good
good
…
good
bad
…
inconsistent measurements
Don’t know whether the failed link is a shared path or only on mKàtN path!
Aggregation: Basic idea
• Consistency has a cost – Delays fault localization – Cannot identify short failures
• 75
m1
t1 tN
mK
mK,t1
mK,tN …
m1,t1
m1,tN
…
path reachability
good
bad
…
good
bad …
q Reprobe paths after a failure detected
Aggregation: tradeoff
• Consistency vs. localization speed – Faster localization leads to false alarms
• Short failures such as transient congestion – Slower localization increases repair time and may miss shorter
failures – Different aggregation strategies impact this tradeoff
• Wait for one full probing cycle or multiple cycles?
• Different viewpoints – Network operators
• Too many false alarms are unmanageable • Longer failures are the ones that need intervention
– End users • Even short failures affect performance
• 76
Outline
• What is QoS? – Overview of QoS mechanisms
• Network diagnostics and traffic analysis – What, why, and how?
• Measuring networks – Topology discovery – Bandwidth measurements – Network Tomography
• Traffic analysis – Root cause analysis – Application-level analysis – Traffic anomaly detection
• Conclusions
Traffic analysis
• Passive measurements – Record traffic traces (and other logs) – Infer something from the traces and logs
• Typical analysis techniques include – Statistical methods – Machine learning – Protocol reverse engineering
• Scalability often important – Large traces ensure statistically significant observations
• Three different examples – Root cause analysis – Application-level analysis – Anomaly detection
• 78
Root cause analysis of TCP traffic
• Focus on TCP traffic – Transports >90% of all Internet traffic
• What is root cause analysis (RCA)? – Find the origin of an observed phenomenon à root cause
• Most important performance metrics – Throughput
• Bits per second from application to application carried by TCP – Response time
• Delay between sending a request and receiving last bit of response • E.g. Google search
• 79
Motivation
• ISPs want to know what kind of QoS clients get and why • What are the dominant performance limitations that
Internet applications are facing? • Why does a client with 40Mbps broadband access get
only total download rate of few Mbps using P2P? • Why does it take a long time to load a web page? • The network provides few answers directly
– The network elements are by design not intelligent
• 80
General approach
• Goal: Infer reasons that – prevent a given TCP connection from achieving a higher
throughput – cause major portions of response time
• Passive measurements – Capture and store all TCP/IP headers – Analyse all individual connections later off-line
• Measure traffic at a single point – Applicable in diverse situations – E.g. at the edge of an ISP’s network
• Know all about clients’ downloads and uploads
81
General approach (cont.)
1. Estimate values of a set of parameters – Estimates extracted from packet headers – E.g. round trip time (RTT)
2. Compute a set of metrics related to performance – Based on estimated parameter values
3. Infer dominant root causes (i.e. limitation causes) – Based on combination of performance metrics
• 82
Challenges (1/3)
• Single measurement point anywhere along the path – Cannot/don’t want to control it – Complicates estimation of parameters (RTT and cwnd)
83
A: RTT ~ d1 ⇒ piece of cake… B: RTT ~ d3+d4
⇒ How to get d4? • (Did ack2 trigger • data2?)
ack2
A B
Challenges (2/3)
• A lot of data to analyze – Potentially millions of connections per
trace
• Deep analysis – Compute a lot of metrics for each
connection of each trace – Need to keep track of everything
84
Need solutions for data management èe.g. DBMS
Challenges (3/3)
• Find the right metrics to characterize all limitations – Lots of empirical studying
• Get it right! – We have no ground truth when measuring in the wild! – Careful validations
• Benchmark with a lot of reference traces • Cross validate metrics
85
What can limit TCP throughput?
• Application • Transport layer
– TCP receiver • Receiver window limitation
– TCP protocol • Congestion avoidance mechanism…
• Network layer – Bottleneck link
86
88
Application that sends larger bursts separated by idle periods § BitTorrent, HTTP/1.1 (persistent), also streaming
only keep-alive messages
transfer periods
Limitation Causes: Application
• The application does not even attempt to use all network resources • TCP connections are partitioned into two periods:
– Bulk Transfer Period (BTP): application provides constantly data to transfer
• Never run out of data in buffer B1 – Application Limited Period (ALP): opposite of BTP
• Sending TCP is idle because B1 is empty
89
Application Application
TCP TCP Network
Sender Receiver
buffers B1
Limitation Causes: TCP Receiver
• Receiver advertised window limits the rate – Max amount of outstanding bytes = min(cwnd,rwnd) – Sender is idle waiting for ACKs to arrive
• Flow control – Sender application overflows receiving application – Buffer B2 is full
• Configuration problem (unintentional) – Default receiver advertised window is set too low – Window scaling is not enabled
• 90
Application Application
TCP TCP Network
Sender Receiver
buffers B2
Limitation Causes: TCP protocol
• Slow start and congestion avoidance increase sending rate until hit network set limit – Ramp up takes some time
• Especially short connections often limited by the ramp up time
• Depends on – Initial congestion window – TCP protocol version (NewReno, Cubic, CTCP…)
• 91
Limitation Causes: Network
• Limitation is due to congestion at a bottleneck link • Shared bottleneck
– Obtain only a fraction of its capacity
• Non-shared bottleneck – Obtain all of its capacity
• 92
What contributes to response time?
• All the same limitation causes as with throughput – Low throughput inflates overall response
time – Time spent on transferring requests and
responses • End points add to response time
– Service takes time to craft responses • E.g. backend map-reduce system for
query processing – Overloaded server will inflate the
response time • Client requests need to wait for service
• 93
Adapt search page
Query and prepare results
Type key words
reponse time through
put limitation
How to do the root cause analysis?
• We need – Necessary parameter estimates – Metrics to use and a way to compute them – Algorithms for inferring root causes from the metrics
• We’ll look at how to find TCP throughput limitation causes as an example case
• 94
RCA of TCP throughput: One approach
1. Partition connections into BTPs and ALPs – If sending application limits throughput, TCP does not even try
to reach limits of the network or the receiver à Network or transport level limitation not possible – First need to isolate and filter out such cases
2. Analyze the BTPs for limitation by – TCP receiver – TCP protocol – Network
• Methods are based on metrics computed from packet headers
95
First step: Filter out application limited periods • Fact: TCP always tries to send MSS size packets à small packets (size < MSS) and idle time indicate application limitation
• Buffer between application and TCP is empty
96
Time Idle time > RTT
MSS packet
packet smaller than MSS
ALP
…
ALP
…
large fraction of small packets
BTP
Second step: BTP Analysis
1. Compute limitation scores for each BTP – These are the metrics – 4 quantitative scores between zero and one – Parameters used to compute metrics:
• retransmission rate, inter-arrival time pattern, path capacity, RTT…
2. Perform classification of BTPs into limitation causes – Map (combination of) limitation scores into a cause – Threshold-based decision tree
• 97
Classification scheme
• Here we infer the root cause
• 4 thresholds need to be calibrated
98
Burstiness score: How bursty is the traffic (does the sender spend time waiting for acks)?
Dispersion score: How close to path capacity is the throughput?
Retransmission score: How high is packet loss?
Receiver window limitation score: How close to receiver set limit is the TCP sending window?
Calibrating the classification thresholds
• Difficult task: Diversity vs. Control – Reference data needs to be representative & diverse
enough – Need to control experiments in some way to get what we
want • Try to generate transfers limited by certain cause
– FTP downloads from Fedora Core mirror sites • 232 sites covering all continents
– Artificial bottleneck links with rshaper • network limitation
– Nistnet to add delay • receiver limitation (Wr/RTT < bw)
– Control the number of simultaneous downloads • unshared vs. shared bottleneck
• 99
Internet
Australia Japan
Finland USA
France Rshaper Nistnet
101
Example analysis of real network traffic
• Applied RCA on customer traffic of ADSL access 24 hours of traffic on March 10, 2006
• 290 GB of TCP traffic – 64% downstream, 36% upstream
• Analyzed traffic from 1335 clients
Two pcap probes
Internet collect network
access network
102
What did we learn?
• Most clients’ performance limited by applications • Very low link utilizations for application limited traffic • Most of application limited traffic seems to be P2P
– Peers often have asymmetric uplink and downlink capacities – P2P applications/users enforce upload rate limits
⇒ Most clients’ download performance suffers from P2P clients upload rate limiters
Internet
Low utilization Low capacity+rate limiter
downloading client
uploading clients
Outline
• What is QoS? – Overview of QoS mechanisms
• Network diagnostics and traffic analysis – What, why, and how?
• Measuring networks – Topology discovery – Bandwidth measurements – Network Tomography
• Traffic analysis – Root cause analysis – Application measurements – Traffic anomaly detection
• Conclusions
Application measurements
• Lots of work done to measure specific applications • Typically focus shifts with the dominant applications
– Early days: Web browsing and email – 5-10 years ago: P2P – Now: video streaming and social networking
• Why? – Detect their usage
• Block, give low/high priority – Understand how users perceive the service
• E.g. Does the video stream break?
• Let’s look at some examples…
• 104
P2P traffic identification
• Need to identify it before it can be characterized… • Enforcing regulations and rules • P2P uses TCP ports of other applications (e.g. 80)
– Circumvent firewalls and “hide” from authorities • Identification by well-know TCP ports
☺ Fast and simple ☹ May capture only a fraction of the total P2P traffic
• Search application specific keywords from packet payloads ☺ Generally very accurate ☹ A set of legal, privacy, technical, logistic, and financial obstacles ☹ Need to reverse engineer poorly documented P2P protocols ☹ Payload encryption in P2P protocols (e.g. some BitTorrent clients)
• 105
P2P traffic identification (cont.)
• Transport layer connection patterns – Observe connection patterns of source and destination Ips ☺ Can give very good results ☹ Limited by knowledge of the existing connection patterns
• “Early identification” – Observe size and direction of first few packets of connection – Works also for some encrypted traffic (SSL) ☺ Robust: identify > 90% of unencrypted and > 85% of encrypted
connections ☺ Simple and fast ☹ Need to train the system offline
• 106
• 107
Measuring YouTube
• Most popular video streaming site of the Internet – Huge number of users, amount of user generated content, traffic
volumes • Many studies have been conducted
– Some investigate the service from end users perspective • What kind of service do I get and why?
– Others look at ISPs viewpoint • From where does YouTube serve my clients? • What kind of traffic does it generate to my network?
– Yet others study service provider’s perspective • Which videos are watched?
– Different kinds of measurement data • End user measurements (“easy”) • Crawling YouTube provided statistics (“easy”) • Traffic collected within ISP network (need ISP to collaborate)
Example YouTube traffic analysis study
• We look at one example investigation – A. Finamore et al.: YouTube everywhere: impact of device and
infrastructure synergies on user experience. IMC 2011.
• Methodology – Collected traffic from several vantage points – Traffic classification and analysis using Tstat – Per-connection TCP statistics – Deep packet inspection to parse HTTP messages
• Classify the type of content and device • Identify the “control” messages • Per-video statistics (video duration, resolution, codec, ...)
• 108
Data sets
• Week-long tracing on Sep. 2010 • 5 vantage points in Europe and US
– 4 access technologies - ADSL, FTTH, Ethernet, WiFi – Both Residential ISPs and Campus networks – Mobile-player access YouTube via WiFi (no 3G/4G)
• Now, let’s look at some results…
• 109
Fraction of video downloaded
• About 80% of sessions early aborted by users – 60% of aborted videos watched for less than 20% of their duration
• iPhone’s YouTube player downloads extra bytes – Not sure about the reason…
Download only a portion of the video
Download more than the entire video ??!?!?
Fraction of video downloaded = dl bytes/video size
How many bytes are wasted?
• 20% of aborted mobile streaming sessions downloaded more than 5 times what could be played
• PC users wasted less bytes – Mobile players have more aggressive buffering policies at the player
• Overall waste of data during peak hours – PC players: 39% – Mobile players: 47%
• 111
Startup latency
• Startup latency: time elapsed between the video request and the first data packet – Lower bound of what the user
experience (don’t know the initial buffering)
• More than 10% of the requests are redirected to another server – 10% more likely for Mobile-player – Large impact on startup latency
• 112
What did we learn?
• PC and mobile players behave differently • Players could be optimized to limit data waste
– Figure out more optimal scheduling of content download – Especially mobile players
• data plan quota and energy waste • Downloading >100% of video content…
• Cache selection/load balancing could be perhaps optimized – Reduce redirections
• 113
Outline
• What is QoS? – Overview of QoS mechanisms
• Network diagnostics and traffic analysis – What, why, and how?
• Measuring networks – Topology discovery – Bandwidth measurements – Network Tomography
• Traffic analysis – Root cause analysis – Application measurements – Traffic anomaly detection
• Conclusions
Anomaly detection
• Study abnormal traffic – Non-productive traffic, a.k.a. Internet “background radiation” – Traffic that is malicious (scans for vulnerabilities, worms) or mostly
harmless (misconfigured devices) • Network troubleshooting
– Identify and locate misconfigured or compromised devices • Intrusion detection
– Identify malicious activity before it hits you – Analyze traffic for attack signatures
• Characterizing malicious activities – Honeypot: a host that sits and waits for attacks
• Learn how attackers probe for and exploit a system – Network telescope: portion of routed IP address space on which
little or no legitimate traffic exists
• 115
Anomaly detection techniques
• Typical procedure – Learn underlying models of normal traffic
• May require long time (days or weeks) – Detect deviations from the normal
• Several techniques exist, based on statistical analysis – PCA: Principal Component Analysis – Kalman filter – Wavelets – …
• Some techniques require little or no training – ASTUTE
• 116
Outline
• What is QoS? – Overview of QoS mechanisms
• Network diagnostics and traffic analysis – What, why, and how?
• Measuring networks – Topology discovery – Bandwidth measurements – Network Tomography
• Traffic analysis – Root cause analysis – Application measurements – Traffic anomaly detection
• Conclusions
Wrapping up
• Diagnosing networks may be quite challenging – Complexity – Scale – Network elements are dumb
• Hence, solutions can be quite sophisticated – Simply analyzing alarms, triggers may be time consuming and allows faults “fly
under the radar” • False positives, black holes, complex dependencies, …
– Statistical analysis, data mining – Combine many sources of data, even “low” quality
• Whatever is available • Two viewpoints:
– Measuring network properties • Usually apply active probing
– Measuring traffic • Passive measurements
• 118
Want to know more?
1. V. Jacobson, “Pathchar: A Tool to Infer Characteristics of Internet Paths,” ftp://ftp.ee.lbl.gov/pathchar/, Apr. 1997. 2. A.B. Downey, “Using Pathchar to Estimate Internet Link Characteristics,” in Proceedings of ACM SIGCOMM, Sept. 1999, pp. 222–223. 3. K. Lai and M.Baker, “Measuring Link Bandwidths Using a Deterministic Model of Packet Delay,” in Proceedings of ACM SIGCOMM, Sept. 2000, pp. 283–
294. 4. A. Pasztor and D. Veitch, “Active Probing using Packet Quartets,” in Proceedings Internet Measurement Workshop (IMW), 2002. 5. R. S. Prasad, M. Murray, C. Dovrolis, and K. Claffy, “Bandwidth estimation: metrics, measurement techniques, and tools,” in IEEE Network, November/
December 2003. 6. S. Keshav, “A Control-Theoretic Approach to Flow Control,” In Proceedings of ACM SIGCOMM, Sept. 1991, pp. 3–15. 7. R. L. Carter and M. E. Crovella, “Measuring Bottleneck Link Speed in Packet-Switched Networks,” Performance Evaluation, vol. 27,28, pp. 297–318,
1996. 8. V. Paxson, “End-to-End Internet Packet Dynamics,” IEEE/ACM Transaction on Networking, vol. 7, no. 3, pp. 277–292, June 1999. 9. M. Jain and C. Dovrolis, “End-to-End Available Bandwidth: Measurement Methodology, Dynamics, and Relation with TCP Throughput,” in Proceedings of
ACM SIGCOMM, Aug. 2002, pp. 295–308. 10. B. Melander, M. Bjorkman, and P. Gunningberg, “Regression-Based Available Bandwidth Measurements,” in International Symposium on Performance
Evaluation of Computer and Telecommunications Systems, 2002. 11. V. Ribeiro, R. Riedi, R. Baraniuk, J. Navratil, and L. Cottrell, “pathChirp: Efficient Available Bandwidth Estimation for Network Paths,” in Proceedings of
Passive and Active Measurements (PAM) workshop, Apr. 2003. 12. N. Hu and P. Steenkiste, “Evaluation and Characterization of Available Bandwidth Probing Techniques,” IEEE Journal on Selected Areas in
Communications, 2003. 13. K. Harfoush, A. Bestavros, and J. Byers, “Measuring Bottleneck Bandwidth of Targeted Path Segments,” in Proceedings of IEEE INFOCOM, 2003. 14. M. Allman, “Measuring End-to-End Bulk Transfer Capacity,” in Proceedings of ACM SIGCOMM Internet Measurement Workshop, Nov. 2001, pp. 139–
143. 15. L.Lao, M.Y.Sanadidi, C. Dovrolis. The Probe Gap Model can Underestimate the Available Bandwidth of Multihop Paths. In the ACM SIGCOMM
Computer Communications Review, October 2006. 16. D.Antoniades, M.Athanatos, A.Papadogiannakis, E.P.Markatos, C. Dovrolis. Available bandwidth measurement as simple as running wget. Passive and
Active Measurements (PAM) conference, March 2006. 17. Rohit Kapoor et al.: CapProbe: A Simple and Accurate Capacity Estimation Technique (SIGCOMM 2004) 18. D. Croce, T. En-Najjary, G. Urvoy-Keller and E. Biersack. Capacity Estimation of ADSL links. Proc. of the 4th ACM CoNEXT conference, December
2008.
• 119
Bandwidth measurements
Want to know more?
1. Pietro Marchetta, Pascal Mérindol, Benoit Donnet, Antonio Pescapé and Jean-Jacques Pansiot. Topology Discovery at the Router Level: A New Hybrid Tool Targeting ISP Networks. IEEE Journal on Selected Areas in Communication, Special Issue on Measurement of Internet Topologies, 2011. to appear.
2. Jean-Jacques Pansiot, Pascal Mérindol, Benoit Donnet and Olivier Bonaventure. Extracting Intra-Domain Topology from mrinfo Probing. In Arvind Krishnamurthy and Bernhard Plattner, editor, Proc. Passive and Active Measurement Conference (PAM), pages 81-90, April 2010. Springer Verlag.
3. B. Donnet, P. Raoult, T. Friedman and M. Crovella. Deployment of an Algorithm for Large-Scale Topology Discovery. IEEE Journal on Selected Areas in Communications, Sampling the Internet: Techniques and Applications, 24(12):2210-2220, Dec. 2006.
4. Y. Bejerano, “Taking the Skeletons Out of the Closets: A Simple and Efficient Topology Discovery Scheme for Large Ethernet,” Proc. IEEE INFOCOM, Apr. 2006.
5. Z. M. Mao et al., “Scalable and Accurate Identification of ASlevel Forwarding Paths,” Proc. IEEE INFOCOM, Mar. 2004.
6. P. Mahadevan et al., “The Internet AS-Level Topology: Three Data Sources and One Definitive Metric” ACM SIGCOMM Computer Commun. Rev iew, vol . 36, no. 1, Jan. 2006, pp. 17–26.
7. B. Donnet and T. Friedman. Internet Topology Discovery: a Survey. IEEE Communications Surveys and Tutorials, 9(4):2-15, December 2007.
• 120
Topology discovery
Want to know more?
1. Castro, M. Coates, G. Liang, R. Nowak, and B. Yu, “Network Tomography: Recent Developments”, Statistical Science, Vol. 19, No. 3 (2004), 499-517.
2. A. Adams et al., “The Use of End-to-end Multicast Measurements for Characterizing Internal Network Behavior”, IEEE Communications Magazine, 2000.
3. N. Duffield, “Network Tomography of Binary Network Performance Characteristics”, IEEE Transactions on Information Theory, 2006.
4. R. R. Kompella, J. Yates, A. Greenberg, A. C. Snoeren, “Detection and Localization of Network Blackholes”, IEEE INFOCOM, 2007.
5. A. Dhamdhere, R. Teixeira, C. Dovrolis, and C. Diot, “NetDiagnoser:Troubleshooting network unreachabilities using end-to-end probes and routing data”, CoNEXT, 2007.
6. I. Cunha, R. Teixeira, N. Feamster, and C. Diot, “Measurement Methods for Fast and Accurate Blackhole Identification with Binary Tomography”, IMC, 2009.
7. E. Katz-Bassett, H. V. Madhyastha, J. P. John, A. Krishnamurthy, D. Wetherall, T. Anderson, “Studying Black Holes in the Internet with Hubble”, NSDI, 2008.
8. M. Zhang, C. Zhang, V. Pai, L. Peterson, and R. Wang, “PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services”, OSDI, 2004.
9. H. X. Nguyen , R. Teixeira, P. Thiran, and C. Diot, " Minimizing Probing Cost for Detecting Interface Failures: Algorithms and Scalability Analysis", INFOCOM, 2009.
• 121
Network Tomography
Want to know more?
• 122
1. Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal, Jitendra Padhye, and Paramvir Bahl. 2009. Detailed diagnosis in enterprise networks. In Proceedings of the ACM SIGCOMM 2009.
2. Ajay Anil Mahimkar, Han Hee Song, Zihui Ge, Aman Shaikh, Jia Wang, Jennifer Yates, Yin Zhang, and Joanne Emmons. Detecting the performance impact of upgrades in large operational networks. In ACM SIGCOMM 2010.
3. Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage. California fault lines: understanding the causes and impact of network failures. In ACM SIGCOMM 2010.
Detailed diagnostics
Anomaly detection 1. Fernando Silveira, Christophe Diot, Nina Taft, and Ramesh Govindan. 2010. ASTUTE: detecting a
different class of traffic anomalies. In Proceedings of the ACM SIGCOMM 2010. 2. Yin Zhang, Matthew Roughan, Walter Willinger, and Lili Qiu. 2009. Spatio-temporal compressive
sensing and internet traffic matrices. In Proceedings of the ACM SIGCOMM 2009. 3. Anukool Lakhina, Mark Crovella, and Christophe Diot. 2005. Mining anomalies using traffic feature
distributions. In Proceedings of SIGCOMM 2005. 4. Haakon Ringberg, Augustin Soule, Jennifer Rexford, and Christophe Diot. 2007. Sensitivity of PCA
for traffic anomaly detection. In Proceedings of the 2007 ACM SIGMETRICS.