berkeley-helsinki summer course lecture #7: network measurement and monitoring
DESCRIPTION
Berkeley-Helsinki Summer Course Lecture #7: Network Measurement and Monitoring. Randy H. Katz Computer Science Division Electrical Engineering and Computer Science Department University of California Berkeley, CA 94720-1776. Outline. Web Traffic Measurement Multi-layer Tracing and Analysis - PowerPoint PPT PresentationTRANSCRIPT
1
Berkeley-Helsinki Summer Course
Lecture #7: Network Measurement and
MonitoringRandy H. Katz
Computer Science Division
Electrical Engineering and Computer Science Department
University of California
Berkeley, CA 94720-1776
2
Outline
• Web Traffic Measurement• Multi-layer Tracing and Analysis• Network Distance Mapping• SLA Verification• Service Management
3
Outline
• Web Traffic Measurement• Multi-layer Tracing and Analysis• Network Distance Mapping• SLA Verification• Service Management
4
Measuring/CharacterizingWeb Traffic
• Motivation for Measurement– Insights into Web site design – Managing Proxies and Servers– Operating IP Networks
• Measurement Process– Monitoring from some network location– Generate measurement records in some format– Preprocessing for subsequent analysis
• Based on Chapter 9, “Web Traffic Measurement,” in Web Protocols and Practice, Krishnamurthy and Rexford, Addison Wesley, Reading, MA, 2001.
5
Web Measurment
• Content Creators– Measurements of user browsing patterns
» Number of visitors, site stickiness influences advertising revenue
» Optimize for common user sequences» User perceived latency influences server and
placement decisions
• Web Hosting Company– Number of response messages/bytes served influence
load balancing strategy among multiple hosted sites» Mix of busy day sites/busy night sites » Managing persistent connections
– Resource usage influences billing» When to introduce more servers, better connectivity
6
Web Measurement
• Network Operators– Resource decisions: where to add bandwidth, when to
upgrade links, where to place proxies, caches, how to modify routing within the provider cloud, etc.
– User community: relative mix of clients with low vs. high bandwidth connectivity
• Web/Networking Researchers– Evaluating performance of protocols and software– Drive evolution of protocols, policies, algorithms– Better understanding of Internet traffic dynamics
7
Measurement Techniques
• Server Logging– Log entry per HTTP request– Requesting client
» Could be a user, a proxy, or a cache—the latter two represent aggregated patterns
» Identified by an IP addresd• Could represent the workload of multiple users• Dynamically assigned addresses not correlated with same
user each time encountered
– Request time– Request/response messages– Coarse grained, aggregated times– NOTE: proxy/cache satisfied requests filtered before
reaching the server– Hard to obtain!
8
Measurement Techniques
• Proxy Logging– Proxies can be associated with clients or servers, e.g.,
proxy for UC Berkeley vs. proxy for Google– Former provides insights into client behavior
aggregated by administrative domain; more detailed information about individual clients may be available
– Degree of aggregation depends on how close proxy is to clients (close implies small community, far implies large community)
– Limited scope, accesses filtered by browser caches– Hard to obtain!
9
Measurement Techniques
• Packet Monitoring– Network level logging (HTTP, IP, TCP)– Fine grained time stamping possible– Some requests satisfied from client caches, encrypted
packets could represent collection difficulties– Monitor needs to be placed so as to be able to ease
drop on packets
10
Measurement Techniques
• Active Measurement– Generate requests in a controlled manner, observe
their performance– Issues:
» Where to locate the modified user agents—geographical placement, quality of connectivity to wide-area network
» What requests to generate—e.g., based on profile of popular web sites
» What measurements to collect—DNS queries, TCP timeouts, proxy interception difficult to distinguish sources of latencies
11
Inferences from Measurement Data
• Limitation of HTTP Header Information– Incomplete header logging– Heuristics needed to reconstruct behavior from log
• Ambiguous Client/Server Identity– Client identity/unique IP address– Many IP addresses associated with same server
• Inferring User Actions– Difficult to correlate user level actions like mouse clicks
with observed network activity– One click many http requests
• Detecting Resource Modifications– Web level actions typically miss modifications– Incomplete use of Last-Modified and Date fields by
servers
12
Web Workload Characterization
• Applications of Workload Models– Identifying performance problems
» High latency/low thruput under specific load scenarios
– Benchmarking Web components» Selecting among competing architectures
– Capacity planning» “Right sizing” net b/w, CPU, disk, memory given
expected loads
• Workload Parameters– Protocols: Request method/Response code– Resources: Content type, Resource size, Response size,
Popularity, Modification frequency, Temporal locality, Number of embedded resources
– Users: Session interarrival times, Number of clicks per session, Request interarrival times
13
Workload Characteristics
• HTTP Requests/Responses– GET method predominates, small number of POSTs (forms),
OK responses– More intelligent protocols for communicating with caches
may change distribution of requests (e.g., HEAD)
• Web Resources– Text and images dominate, increasing audio/video content– Small resource size dominates, average HTML file size is 4-
8 KB, image file size 14 KB, wide variation around the mean implies Pareto distribution (“heavy tailed”)
– Higher b/w connections imply larger web objects over time
• Response Sizes– Users likely to abort large transfers, so median response
size smaller than median resource size; very heavy tail– Effect of higher b/w connections on response size?
14
Workload Characteristics
• Resource Popularity– Zipf’s Law: a small number of objects are highly
popular– Effectiveness of caching at all levels (client browser
cache, site proxy cache, even DNS name cache)
• Resource Changes– Static content vs. script-based descriptions– Periodic changes (“young die young”)
• Temporal Locality– Correlated access to resources in time
• Embedded Resources– Web pages have median of 8-20 embedded resources,
heavy tailed distribution
15
Workload Characteristics
• User Behavior– Session and request arrivals
» Infer session via repeated access to same server» Burst of HTTP requests, think time
– Clicks per session» 4-10 clicks on average; distinguish between
“sticky” sites and directory/redirection sites» Heavy user vs. light user
– Request interarrival times» Activity punctuated with think times» Request interarrivals order of 60 seconds
16
Research Perspectives on Measurement
• Packet monitoring of HTTP traffic• Analyzing Web server logs• Publicly available logs and traces• Measuring multimedia streams
17
Packet Monitoring of HTTP Traffic
• Tapping a link carrying IP packets• Capturing packets from HTTP transfers• Demux packets into TCP connections• Reconstructing ordered stream of bytes• Extracting HTTP messages from byte
stream• Generating a log of HTTP messages
18
Analyzing Web Server Logs
• Parsing and Filtering– Logs in multiple formats– Interleaved log records– Timestamp diversity
• Transforming– Remove erroneous records– Diverse formats for URLs, conversion to unique
integers for easier processing
19
Publicly Available Logs and Traces
• Internet Traffic Archive– http://www.acm.org/sigcomm/ita
• World Wide Web Consortium’s Web Characterization Group Repository
– http://www.purl.org/net/repository
• NLANR– http://ircache.nlanr.net/Cache/
• CAnet Squid logs– http://ardnoc41.canet2.net/cache/
20
Measuring Multimedia Streams
• Static analysis of multimedia resources– Locating video content at various web sites– Acquiring copies– Computing statistics
• Multimedia server logs– VCR-like operations– User access patterns, frequency of early abort
• Packet monitoring of multimedia streams– Infer session identity from src/dst IP address, port #,
protocol
• Multilayer packet monitoring– Correlation of control and data streams
21
Probability Distributions in Web Workload Models
• Exponential: Session interarrival times• Pareto:
– Response Sizes (tail of distribution)– Resource Sizes (tail of distribution)– Number of Embedded Images– Request Interarrival Times
• Lognormal:– Response sizes (body of distribution)– Resource sizes (body of distribution)– Temporal locality
• Zipf-like: Resource popularity
22
Outline
• Web Traffic Measurement• Multi-layer Tracing and Analysis• Network Distance Mapping• SLA Verification• Service Management
23
Wireless Link Management
• Modeling GSM data network layers– Media access, link, routing, and transport– Validated ns modeling suite and BONES simulator– GSM channel error models from Ericsson
• Reliable Link Protocols– Wireless links have high error rates (> 1%)– Reliable transport protocols (TCP) interpret errors as
congestion» Need tools to determine multi-layer interaction effects» Large amounts of data: 120 bytes/s» Important for design of next generation networks
– One solution: use a reliable link layer (ARQ) protocol» However, retransmissions introduce jitter
– Alternative: use error-resilient algorithms to allow apps to handle corrupted data (only protect network protocol headers)
» Less end-to-end delay, constant jitter, higher throughput
24
Fixed HostUnix BSDi 3.0
GSMBTS
GSM Network
PSTN
Mobile HostUnix BSDi 3.0
Testbed, Protocols, Tools
Socket InterfaceRTP
PacketizationH.263+ Encoder
UDP / UDP Lite
IP
PPP
RTPDe-Packetization
H.263+ Decoder
UDP / UDP Lite
IP
PPP
Socket Interface
Transparent /Non-transparent
Transparent /Non-transparent
SocketDUMP
RLPDUMP
SocketDUMP
RLPDUMP
MultiTracer
Plotting & Analysis(MATLAB)
25
MultiTracer Time-Sequence Plots
398000
400000
402000
404000
406000
408000
410000
412000
414000
416000
480 485 490 495 500 505 510 515 520
Bytes
Tim e of Day (sec)
RlpSnd_rst
18 Segm ents
13 Segm entsdropped at
TCP receiver
TcpRcv_ack
TcpSnd_data
TcpSnd_ack
TcpRcv_data
5 Segm entslost due toRLP Reset
26
Outline
• Web Traffic Measurement• Multi-layer Tracing and Analysis• Network Distance Mapping• SLA Verification• Service Management
27
Applications of Network Distance Mapping
• Mirror Selection• Cache-infrastructure Configuration• Service Redirection• Service Placement• Overlay Routing/Location
28
Distance Mapping Framework
• Feasible distance metrics– Number of hops– Latency– Bandwidth
• Continuous measurement• Provide approximate distance information• Continue to operate in the presence of
components changes/failures• Scale the measurement by self-adaptation
Goal:Develop scalable, robust distance information
collection/sharing infrastructure
29
Distance Mapping Challenges
• Select how may probes/monitors to deploy• Monitor placement• Choose appropriate monitor for given client• Statistically quantify estimation error: e.g,
x% of the estimates within a factor of actual distances
• How stable are these clustering?
30
IDMaps Project
• Internet-wide infrastructure to collect distance information
• IDMaps provides:– Long-term approximate distances– Distance estimation between any 2 points on the
Internet
• IDMaps does not provide:– End-to-end application-level performance– Available bandwidth or current delay– Characteristics of any specific path
31
IDMaps Components
• Tracers: autonomous instrumentation boxes
• Tracers measures distance between themselves and to APs
• APs (Address Prefixes): regions of the Internet; Hosts within AP are equi-distant from rest of Internet
Courtesy of IDMaps group
Hosts in AP near tracer
T*T + AP costT = number of tracersAP = number of APs
tracer
32
IDMaps Architecture
Courtesy of IDMaps group
33
IDMaps Results and Limitations
• Simulation results on synthetic and static network topology
– Cyan: random selection
– Others: various heuristics & algorithms
Percentage of correct answers
Com
ple
menta
ry d
istributio
n
functio
n
Courtesy of IDMaps group
34
IDMaps Limitations
• Based on triangulation inequality
• Consider only number of hops
• Ignore the dynamics of Internet, no stability study
D
Clients
MonitorsA
B
C
AB = AC + CD + DB ?
35
Wide-area Network Measurement and Monitoring
Services
• Layered Architecture– Bottom layer a common core shared across multiple
apps with generic metrics– More application-specific at the top layer
• Modularity– Separation of functionality– Clear definition of interaction between different
layers– Ease of customization and modification
Goal: Understand behavior of Internet/provide adaptation to Internet apps thru monitoring
services
36
Layered Architecture
Measurement Layer
Measurement Collection, Transformation and Storage Layer
Federation for Sharing Layer
Dissemination Layer
Decision/Design Procedures
What to measure, what tools?Probe placement & density
Pull-/push- based APIs
Application side
37
Current Focus at Berkeley: Internet “Iso-bar”
• Regions of network that perceive similar performance to the Internet, i.e., spatial correlation
– How to find it without knowing the topology?
• Used to determine # and placement of monitors;High dimensional feature space for iso-bar clustering
– Each host collects distance values to m hosts as m-dim feature vector
– Use K-means for high-dimension clustering– Choose site closest to the cluster center as monitor– Initially m can be the total number of clients, later it may
be the number of representative monitoring sites
38
Iso-bar Experiments
• Remove triangulation inequality assumption• Stationarity: Predictability of network
properties – temporal correlation– Global stationarity: change of the total number of
clusters– Local stationarity: expand and shrink of each cluster
• Experiements with NLANR Active Measurement Project (AMP) data set
– 119 sites on US and New Zealand– Traceroute between every pair of hosts every minute– Use daily average round-trip time (RTT)– Color the clustered hosts and map them on US map
with longitude and latitude info (imprecise mapping)
39
Geographic Distribution of NLANR AMP Monitoring
Sites
40
Underlying Topology of NLANR Sites
Most of the NLANR sites use Abilene Network
41
Preliminary Clustering Results
42
Stationarity of Iso-bar• Global stationarity quite good• Local stationarity still under investigation• Will apply more statistical learning methods, e.g., Gaussian
mixture model, kernel methods for clustering and its dynamics• Will evaluate its prediction with real measurement data
43
Inferring Internet Topology
Goal: Determine hierarchy amongst autonomous
systems(AS) based on types of relationships among them
• Assume two-types of relationships– Provider-Customer– Peer-Peer
• Providers are above customers in the hierarchy; peers mostly in same level in the hierarchy.
• Inferences– 5-level hierarchy in the Internet– Connectivity across levels is strictly non-hierarchical
44
Inferring Internet Topology
• CAIDA & Mercator– Traceroutes from diff locations to get connectivity– Whois & BGP dumps to find IP addr ownership
• Krishnamurthy et al.– BGP dumps to find IP addr ownership– Use web server logs to cluster IP addrs by behaviour
• GT-ITM– Generated topologies– Useful for testing on specific cases, but not actual Internet
• Our work– BGP dumps to find AS connectivity– BGP dumps to find amount of paths carried by each link– BGP dumps to find AS preferences for links
45
46
Inferring Type of Relationship
Assumption: ISPs with high probability do not forward BGP advertisements from its peers or providers to other peers or providers
• Implication: If assumption is completely true, every AS path is “valley-free” (no traversal from peer/provider to customer and back to peer/provider)
• Features of inference algorithm – Collected large # of BGP dumps;
Partial views of Internet from different sources– Assign every AS rank based on every dump;
Apply dominance/clustering rules to find type of relationships
47
Layers in the Internet
• Layer 0 (Strong Core)– Dense sub-graph(peering links) of the Internet topology
consisting of only Tier-1 ISPs
• Layer 1 (Transit Core)– Consists of all top transit providers/large national ISPs
• Layer 2 (Outer Core)– Last layer where any two ASs have peering relationship
• Layer 3 (Regional)– Collection of regional ISPs that support small customer
base
• Layer 4 (Customers)– Large collection (87%) of ASs that are only customers
48
Our Findings
• Innercore of 20 AS’s is highly connected
– 271 edges (full clique = 380)
• Full graph has 10918 AS’s
– 24,598 edges out of 119,191,806 possible edges
• Distribution of paths carried by edges
49
Our Graph of the Core
50
Quantifying the Layering
Layer # of ASs % # Intra-Layer Edges
# Inter-Layer edges
Strong Core
Transit Core
Outer Core
Regional
Customers
20
162
674
950
8852
0.2
1.5
6.3
9.2
83.0
329 9600
1052 6000
1070
3600
202 2400
0 0
Note: Edges directed from providers to customers; peer-peer links directed both ways
51
Outline
• Web Traffic Measurement• Multi-layer Tracing and Analysis• Network Distance Mapping• SLA Verification• Service Management
52
“Trust but Verify”
• Monitoring is integral to SLA verification• Built on top of SNMP Architecture
– SNMP Agents– SNMP Manager– SNMP Protocol (polling/trapping)– Objects and Management Information Bases (MIBs)
Manager
Management Station
SNMPNetwork
Ethernet I/FManaged Element Agent
Managed Node
Agent
53
Network ConnectivitySLA Monitoring
• Need to monitor availability, traffic (bandwidth, latency) between access routers
• Standard SNMP MIBs– Current interface status (up/down)– Time since last status change– # bytes/packets received/transmitted– # packets discarded/received in error– Length of packet queue
• Not really sufficient for determining connectivity SLA!
54
Remote Monitoring of IP Network
• RMON Architecture– Manager (SNMP Manager), Probe Points (SNMP
Agents)– Network is a collection of LAN segments;
For each, collect:» Segment statistics (e.g., packet counts)» Host specific statistics» Traffic matrix between hosts on same segment
– Lots of stats can be collected by difficult to correlate across the LAN segments
– Best for finding bottleneck segments and to drive capacity planning
– Not helpful for delays or latency measurements
55
Monitoring Flows
• Flow: correlated subset of network traffic, e.g., with a common source and destination
• Cisco Proprietary NetFlow Architecture– Flow Collector– Router to collect the flow information– Traffic counts on virtual links
• IETF Real-Time Flow Monitoring– Standardized Flow MIB
56
Network Monitoring with Active Probing
• Ping Program– Active probing via ICMP echo messages– Determines loss rates and delays
• Traceroute– Path and estimated delay that packet followed in the
IP network– Sends multiple ICMP packets with increasing TTL,
discovering routers due to ICMP TTL expired messages– This can cause high variability in the reported delays
• NTP Sync Messages– Clock offset, round trip delay, dispersion info
exchange
• Various Statistical Probing Schemes– Delays and loss rates
57
SLA Monitoring Issues
• Client- versus operator-side monitoring and reporting
• Monitoring in multi-class network• Transport- and Application-level monitoring• Monitoring in an overlay network• Monitoring in a multi-service provider
environment (finding “the weakest link”)• Accuracy in monitoring
– Number of measurements, frequency of measurements, stability of results, confidence intervals
58
Measurement Pointsfor Verifying SLAs
• Distinguish between measuring within service provider cloud and end-to-end between customer nodes
59
Outline
• Web Traffic Measurement• Multi-layer Tracing and Analysis• Network Distance Mapping• SLA Verification• Service Management
60
• Server and site availability
• Balanced server and site load
• Rapid change
• Network and application flexibility
• Scalability
• Complex site administration
From Network Management to Service Management
ServerLoad Balancing
Advanced TrafficManagement
• Rapid problem diagnosis/ isolation
• Service level measurement
• Multi-tier resource monitoring
•Preferential Services
•Resource Provisioning
•Self-tuning
•Problemprevention
Service Level Control
Morino, Resonate
61
Network
Failure 18.2%
• ISP connection down• LAN segment overloaded
Service Reliability is Critical
Systems
Server Failure 20%OS Failure 24.6%
• CPU overloaded• NIC failure
Administration 8.7%
Applications
Failure 28.5%
• Process hung• Slowed database performance
Source: IDC Morino, Resonate
62
Traditional Traffic Management
• Single tier, single site,service level control
– Higher service levels– Better resource utilization– Multiple features to meet
unique needsContentServers
Internet
Traffic Management
User
Morino, Resonate
63
Basic LAN Solution Requirements
• Simple load balancing– Establish Virtual IP address (VIP)– Delivers scalable performance,
• Health checks and service monitoring– Look beyond layer 3/4 characteristics– Returned content, response times, etc.– Better information to determine server status– Use traffic management techniques to insulate user
from affects of server or software failure
Morino, Resonate
64
Advanced LAN Solution Requirements
• Complex traffic management– More intelligent policies for application state management– Enforce sophisticated user based policies– Inspection of application header
» URL parsing - Direct requests to systems with available content
• Functional segregation of Web site
» SSL Session IDs - Requirement to maintain persistence • Maintains application state• Multiple TCS sessions within since SSL session
» Cookies - More precise user identification and classification • Look through proxies, firewalls• Establish preferential services
– Integration with WAN solutions
Advanced Traffic Management Features Require Delayed BindingMorino, Resonate
65
Delayed Binding Connection Syn
Syn/Ack
Client (Browser) Server (HTTPd)
Data from server
Push (HTTP Get)
Bound to ‘Server’
Dela
yed
Bin
din
g
Imm
edia
te
Bin
din
g
Morino, Resonate
66
Delayed Binding Issues
• Push Packet contains URL, cookie, all application information (except port number)
• Must read application header to deliver advanced traffic management features
• Delayed Binding is the only way to see the application header before decision is made to ‘bind’ to server
Morino, Resonate
67
Be Careful What You Wish For...
• Now that you have the header, whatdo you do with it?
• Unstructured format, applicationspecific, might be encrypted
• CPU Sink hole!!!!
• Be sure to watch what happens to throughput when you turn on Delayed Binding features
Morino, Resonate
68
ContentServers
Internet
Traffic Management
User
App Servers Data Layer
Systems ManagementServer side
instrumentation
Deeper Visibility for Managing
Complex Infrastructures
Multi-tier service level control
• Instrument back-end systems
• Capture health and status
• Diagnose and isolate problems
• Take correctiveaction immediately!
Morino, Resonate
69
Redundant Site Implementation: Growth and
Failover• Multi-site service level
control• Higher service levels• Better resource utilization
• Not a networking solution• Not a performance issue
– POP persistencedominates issues
WAN Traffic Management
Internet
User
SF
NY
SF
Morino, Resonate
70
Management and Administration
is CrucialConsolidated view of multiple
sites
• Eases management of complexe-businesses
• Reduces costs associated withundetected problems
Sys Admin
Denver, CO
Enterprise Services Console
SF
NY
SF
Morino, Resonate
71
Closed-loop Real-Time Controlof IP-based Applications
Policy-Based
Control
Feedback
Feedback
IP-ApplicationTraffic
ManagementFunctions
IP-ApplicationTraffic
ManagementFunctions
Intelligent Service Management
SystemsManagement
Functions
SystemsManagement
Functions
Morino, Resonate
72
Resonate Case Study
• Central Dispatch– Software-based load balancer for servers on a LAN– Sophisticated policy-driven filtering, redirection, load balancing– Class of service support for server access
• Global Dispatch– Multi-site management, wide-area redirection, disaster recovery– Advanced Traffic Mapping capabilities:
» Sticky/persistent session support and sticky session failover » Directed Traffic Table directs users to predefined POP
– Configurable scheduling based on WAN latency and site load – POP failover handling – Advanced stats: avg. DNS response, POP hit rate, other QoS– Coexistence with existing DNS and load balancing architecture – Pass multiple IP addresses to client for browser-based failover – Weighted round-robin scheduling
http://www.resonate.com
73
Resonate Case Study
• Commander– End-to-End monitoring
» URL tests, host access tests, HTTP service availability tests» SNMP traps
– Test, statistics, and control features» Gather availability info: site + Web/app/DB servers» Process events: inaccessible file servers, db, net
congestion, etc. for reporting/initialization of user-defined action
– Features:» Rapid identification and resolution of site problems » Multi-tier resource monitoring of site servers » Identify problems before service levels are affected » Identify network trends essential to optimized site planning » User-defined service mgmt policies for automated control
http://www.resonate.com
74
Resonate Case Study
• Automated Control for Policy-Based Problem Resolution
– Sophisticated server-level control policies– Monitors events/processes them according to pre-defined rules &
action(s)– E.g., sending email/electronic pages, script invocation
• Examples of policy-based control include: – Schedule traffic from Web server w/ slow/failed backend app
server – Increase/decrease traffic to server when perf crosses thresholds – Enable backup content server in a Central Dispatch site when
one or more active content servers fail/become too busy – Monitor apps and server processes; restart any that fail
http://www.resonate.com