berkeley-helsinki summer course lecture #7: network measurement and monitoring

1

Berkeley-Helsinki Summer Course

Lecture #7: Network Measurement and

MonitoringRandy H. Katz

Computer Science Division

Electrical Engineering and Computer Science Department

University of California

Berkeley, CA 94720-1776

2

Outline

• Web Traffic Measurement• Multi-layer Tracing and Analysis• Network Distance Mapping• SLA Verification• Service Management

3

Outline


4

Measuring/CharacterizingWeb Traffic

• Motivation for Measurement– Insights into Web site design – Managing Proxies and Servers– Operating IP Networks

• Measurement Process– Monitoring from some network location– Generate measurement records in some format– Preprocessing for subsequent analysis

• Based on Chapter 9, “Web Traffic Measurement,” in Web Protocols and Practice, Krishnamurthy and Rexford, Addison Wesley, Reading, MA, 2001.

5

Web Measurment

• Content Creators– Measurements of user browsing patterns

» Number of visitors, site stickiness influences advertising revenue

» Optimize for common user sequences» User perceived latency influences server and

placement decisions

• Web Hosting Company– Number of response messages/bytes served influence

load balancing strategy among multiple hosted sites» Mix of busy day sites/busy night sites » Managing persistent connections

– Resource usage influences billing» When to introduce more servers, better connectivity

6

Web Measurement

• Network Operators– Resource decisions: where to add bandwidth, when to

upgrade links, where to place proxies, caches, how to modify routing within the provider cloud, etc.

– User community: relative mix of clients with low vs. high bandwidth connectivity

• Web/Networking Researchers– Evaluating performance of protocols and software– Drive evolution of protocols, policies, algorithms– Better understanding of Internet traffic dynamics

7

Measurement Techniques

• Server Logging– Log entry per HTTP request– Requesting client

» Could be a user, a proxy, or a cache—the latter two represent aggregated patterns

» Identified by an IP addresd• Could represent the workload of multiple users• Dynamically assigned addresses not correlated with same

user each time encountered

– Request time– Request/response messages– Coarse grained, aggregated times– NOTE: proxy/cache satisfied requests filtered before

reaching the server– Hard to obtain!

8


• Proxy Logging– Proxies can be associated with clients or servers, e.g.,

proxy for UC Berkeley vs. proxy for Google– Former provides insights into client behavior

aggregated by administrative domain; more detailed information about individual clients may be available

– Degree of aggregation depends on how close proxy is to clients (close implies small community, far implies large community)

– Limited scope, accesses filtered by browser caches– Hard to obtain!

9


• Packet Monitoring– Network level logging (HTTP, IP, TCP)– Fine grained time stamping possible– Some requests satisfied from client caches, encrypted

packets could represent collection difficulties– Monitor needs to be placed so as to be able to ease

drop on packets

10


• Active Measurement– Generate requests in a controlled manner, observe

their performance– Issues:

» Where to locate the modified user agents—geographical placement, quality of connectivity to wide-area network

» What requests to generate—e.g., based on profile of popular web sites

» What measurements to collect—DNS queries, TCP timeouts, proxy interception difficult to distinguish sources of latencies

11

Inferences from Measurement Data

• Limitation of HTTP Header Information– Incomplete header logging– Heuristics needed to reconstruct behavior from log

• Ambiguous Client/Server Identity– Client identity/unique IP address– Many IP addresses associated with same server

• Inferring User Actions– Difficult to correlate user level actions like mouse clicks

with observed network activity– One click many http requests

• Detecting Resource Modifications– Web level actions typically miss modifications– Incomplete use of Last-Modified and Date fields by

servers

12

Web Workload Characterization

• Applications of Workload Models– Identifying performance problems

» High latency/low thruput under specific load scenarios

– Benchmarking Web components» Selecting among competing architectures

– Capacity planning» “Right sizing” net b/w, CPU, disk, memory given

expected loads

• Workload Parameters– Protocols: Request method/Response code– Resources: Content type, Resource size, Response size,

Popularity, Modification frequency, Temporal locality, Number of embedded resources

– Users: Session interarrival times, Number of clicks per session, Request interarrival times

13

Workload Characteristics

• HTTP Requests/Responses– GET method predominates, small number of POSTs (forms),

OK responses– More intelligent protocols for communicating with caches

may change distribution of requests (e.g., HEAD)

• Web Resources– Text and images dominate, increasing audio/video content– Small resource size dominates, average HTML file size is 4-

8 KB, image file size 14 KB, wide variation around the mean implies Pareto distribution (“heavy tailed”)

– Higher b/w connections imply larger web objects over time

• Response Sizes– Users likely to abort large transfers, so median response

size smaller than median resource size; very heavy tail– Effect of higher b/w connections on response size?

14


• Resource Popularity– Zipf’s Law: a small number of objects are highly

popular– Effectiveness of caching at all levels (client browser

cache, site proxy cache, even DNS name cache)

• Resource Changes– Static content vs. script-based descriptions– Periodic changes (“young die young”)

• Temporal Locality– Correlated access to resources in time

• Embedded Resources– Web pages have median of 8-20 embedded resources,

heavy tailed distribution

15


• User Behavior– Session and request arrivals

» Infer session via repeated access to same server» Burst of HTTP requests, think time

– Clicks per session» 4-10 clicks on average; distinguish between

“sticky” sites and directory/redirection sites» Heavy user vs. light user

– Request interarrival times» Activity punctuated with think times» Request interarrivals order of 60 seconds

16

Research Perspectives on Measurement

• Packet monitoring of HTTP traffic• Analyzing Web server logs• Publicly available logs and traces• Measuring multimedia streams

17

Packet Monitoring of HTTP Traffic

• Tapping a link carrying IP packets• Capturing packets from HTTP transfers• Demux packets into TCP connections• Reconstructing ordered stream of bytes• Extracting HTTP messages from byte

stream• Generating a log of HTTP messages

18

Analyzing Web Server Logs

• Parsing and Filtering– Logs in multiple formats– Interleaved log records– Timestamp diversity

• Transforming– Remove erroneous records– Diverse formats for URLs, conversion to unique

integers for easier processing

19

Publicly Available Logs and Traces

• Internet Traffic Archive– http://www.acm.org/sigcomm/ita

• World Wide Web Consortium’s Web Characterization Group Repository

– http://www.purl.org/net/repository

• NLANR– http://ircache.nlanr.net/Cache/

• CAnet Squid logs– http://ardnoc41.canet2.net/cache/

20

Measuring Multimedia Streams

• Static analysis of multimedia resources– Locating video content at various web sites– Acquiring copies– Computing statistics

• Multimedia server logs– VCR-like operations– User access patterns, frequency of early abort

• Packet monitoring of multimedia streams– Infer session identity from src/dst IP address, port #,

protocol

• Multilayer packet monitoring– Correlation of control and data streams

21

Probability Distributions in Web Workload Models

• Exponential: Session interarrival times• Pareto:

– Response Sizes (tail of distribution)– Resource Sizes (tail of distribution)– Number of Embedded Images– Request Interarrival Times

• Lognormal:– Response sizes (body of distribution)– Resource sizes (body of distribution)– Temporal locality

• Zipf-like: Resource popularity

22

Outline


23

Wireless Link Management

• Modeling GSM data network layers– Media access, link, routing, and transport– Validated ns modeling suite and BONES simulator– GSM channel error models from Ericsson

• Reliable Link Protocols– Wireless links have high error rates (> 1%)– Reliable transport protocols (TCP) interpret errors as

congestion» Need tools to determine multi-layer interaction effects» Large amounts of data: 120 bytes/s» Important for design of next generation networks

– One solution: use a reliable link layer (ARQ) protocol» However, retransmissions introduce jitter

– Alternative: use error-resilient algorithms to allow apps to handle corrupted data (only protect network protocol headers)

» Less end-to-end delay, constant jitter, higher throughput

24

Fixed HostUnix BSDi 3.0

GSMBTS

GSM Network

PSTN

Mobile HostUnix BSDi 3.0

Testbed, Protocols, Tools

Socket InterfaceRTP

PacketizationH.263+ Encoder

UDP / UDP Lite

IP

PPP

RTPDe-Packetization

H.263+ Decoder

UDP / UDP Lite

IP

PPP

Socket Interface

Transparent /Non-transparent

Transparent /Non-transparent

SocketDUMP

RLPDUMP

SocketDUMP

RLPDUMP

MultiTracer

Plotting & Analysis(MATLAB)

25

MultiTracer Time-Sequence Plots

398000

400000

402000

404000

406000

408000

410000

412000

414000

416000

480 485 490 495 500 505 510 515 520

Bytes

Tim e of Day (sec)

RlpSnd_rst

18 Segm ents

13 Segm entsdropped at

TCP receiver

TcpRcv_ack

TcpSnd_data

TcpSnd_ack

TcpRcv_data

5 Segm entslost due toRLP Reset

26

Outline


27

Applications of Network Distance Mapping

• Mirror Selection• Cache-infrastructure Configuration• Service Redirection• Service Placement• Overlay Routing/Location

28

Distance Mapping Framework

• Feasible distance metrics– Number of hops– Latency– Bandwidth

• Continuous measurement• Provide approximate distance information• Continue to operate in the presence of

components changes/failures• Scale the measurement by self-adaptation

Goal:Develop scalable, robust distance information

collection/sharing infrastructure

29

Distance Mapping Challenges

• Select how may probes/monitors to deploy• Monitor placement• Choose appropriate monitor for given client• Statistically quantify estimation error: e.g,

x% of the estimates within a factor of actual distances

• How stable are these clustering?

30

IDMaps Project

• Internet-wide infrastructure to collect distance information

• IDMaps provides:– Long-term approximate distances– Distance estimation between any 2 points on the

Internet

• IDMaps does not provide:– End-to-end application-level performance– Available bandwidth or current delay– Characteristics of any specific path

31

IDMaps Components

• Tracers: autonomous instrumentation boxes

• Tracers measures distance between themselves and to APs

• APs (Address Prefixes): regions of the Internet; Hosts within AP are equi-distant from rest of Internet

Courtesy of IDMaps group

Hosts in AP near tracer

T*T + AP costT = number of tracersAP = number of APs

tracer

32

IDMaps Architecture


33

IDMaps Results and Limitations

• Simulation results on synthetic and static network topology

– Cyan: random selection

– Others: various heuristics & algorithms

Percentage of correct answers

Com

ple

menta

ry d

istributio

n

functio

n


34

IDMaps Limitations

• Based on triangulation inequality

• Consider only number of hops

• Ignore the dynamics of Internet, no stability study

D

Clients

MonitorsA

B

C

AB = AC + CD + DB ?

35

Wide-area Network Measurement and Monitoring

Services

• Layered Architecture– Bottom layer a common core shared across multiple

apps with generic metrics– More application-specific at the top layer

• Modularity– Separation of functionality– Clear definition of interaction between different

layers– Ease of customization and modification

Goal: Understand behavior of Internet/provide adaptation to Internet apps thru monitoring

services

36

Layered Architecture

Measurement Layer

Measurement Collection, Transformation and Storage Layer

Federation for Sharing Layer

Dissemination Layer

Decision/Design Procedures

What to measure, what tools?Probe placement & density

Pull-/push- based APIs

Application side

37

Current Focus at Berkeley: Internet “Iso-bar”

• Regions of network that perceive similar performance to the Internet, i.e., spatial correlation

– How to find it without knowing the topology?

• Used to determine # and placement of monitors;High dimensional feature space for iso-bar clustering

– Each host collects distance values to m hosts as m-dim feature vector

– Use K-means for high-dimension clustering– Choose site closest to the cluster center as monitor– Initially m can be the total number of clients, later it may

be the number of representative monitoring sites

38

Iso-bar Experiments

• Remove triangulation inequality assumption• Stationarity: Predictability of network

properties – temporal correlation– Global stationarity: change of the total number of

clusters– Local stationarity: expand and shrink of each cluster

• Experiements with NLANR Active Measurement Project (AMP) data set

– 119 sites on US and New Zealand– Traceroute between every pair of hosts every minute– Use daily average round-trip time (RTT)– Color the clustered hosts and map them on US map

with longitude and latitude info (imprecise mapping)

39

Geographic Distribution of NLANR AMP Monitoring

Sites

40

Underlying Topology of NLANR Sites

Most of the NLANR sites use Abilene Network

41

Preliminary Clustering Results

42

Stationarity of Iso-bar• Global stationarity quite good• Local stationarity still under investigation• Will apply more statistical learning methods, e.g., Gaussian

mixture model, kernel methods for clustering and its dynamics• Will evaluate its prediction with real measurement data

43

Inferring Internet Topology

Goal: Determine hierarchy amongst autonomous

systems(AS) based on types of relationships among them

• Assume two-types of relationships– Provider-Customer– Peer-Peer

• Providers are above customers in the hierarchy; peers mostly in same level in the hierarchy.

• Inferences– 5-level hierarchy in the Internet– Connectivity across levels is strictly non-hierarchical

44

Inferring Internet Topology

• CAIDA & Mercator– Traceroutes from diff locations to get connectivity– Whois & BGP dumps to find IP addr ownership

• Krishnamurthy et al.– BGP dumps to find IP addr ownership– Use web server logs to cluster IP addrs by behaviour

• GT-ITM– Generated topologies– Useful for testing on specific cases, but not actual Internet

• Our work– BGP dumps to find AS connectivity– BGP dumps to find amount of paths carried by each link– BGP dumps to find AS preferences for links

46

Inferring Type of Relationship

Assumption: ISPs with high probability do not forward BGP advertisements from its peers or providers to other peers or providers

• Implication: If assumption is completely true, every AS path is “valley-free” (no traversal from peer/provider to customer and back to peer/provider)

• Features of inference algorithm – Collected large # of BGP dumps;

Partial views of Internet from different sources– Assign every AS rank based on every dump;

Apply dominance/clustering rules to find type of relationships

47

Layers in the Internet

• Layer 0 (Strong Core)– Dense sub-graph(peering links) of the Internet topology

consisting of only Tier-1 ISPs

• Layer 1 (Transit Core)– Consists of all top transit providers/large national ISPs

• Layer 2 (Outer Core)– Last layer where any two ASs have peering relationship

• Layer 3 (Regional)– Collection of regional ISPs that support small customer

base

• Layer 4 (Customers)– Large collection (87%) of ASs that are only customers

48

Our Findings

• Innercore of 20 AS’s is highly connected

– 271 edges (full clique = 380)

• Full graph has 10918 AS’s

– 24,598 edges out of 119,191,806 possible edges

• Distribution of paths carried by edges

49

Our Graph of the Core

50

Quantifying the Layering

Layer # of ASs % # Intra-Layer Edges

# Inter-Layer edges

Strong Core

Transit Core

Outer Core

Regional

Customers

20

162

674

950

8852

0.2

1.5

6.3

9.2

83.0

329 9600

1052 6000

1070

3600

202 2400

0 0

Note: Edges directed from providers to customers; peer-peer links directed both ways

51

Outline


52

“Trust but Verify”

• Monitoring is integral to SLA verification• Built on top of SNMP Architecture

– SNMP Agents– SNMP Manager– SNMP Protocol (polling/trapping)– Objects and Management Information Bases (MIBs)

Manager

Management Station

SNMPNetwork

Ethernet I/FManaged Element Agent

Managed Node

Agent

53

Network ConnectivitySLA Monitoring

• Need to monitor availability, traffic (bandwidth, latency) between access routers

• Standard SNMP MIBs– Current interface status (up/down)– Time since last status change– # bytes/packets received/transmitted– # packets discarded/received in error– Length of packet queue

• Not really sufficient for determining connectivity SLA!

54

Remote Monitoring of IP Network

• RMON Architecture– Manager (SNMP Manager), Probe Points (SNMP

Agents)– Network is a collection of LAN segments;

For each, collect:» Segment statistics (e.g., packet counts)» Host specific statistics» Traffic matrix between hosts on same segment

– Lots of stats can be collected by difficult to correlate across the LAN segments

– Best for finding bottleneck segments and to drive capacity planning

– Not helpful for delays or latency measurements

55

Monitoring Flows

• Flow: correlated subset of network traffic, e.g., with a common source and destination

• Cisco Proprietary NetFlow Architecture– Flow Collector– Router to collect the flow information– Traffic counts on virtual links

• IETF Real-Time Flow Monitoring– Standardized Flow MIB

56

Network Monitoring with Active Probing

• Ping Program– Active probing via ICMP echo messages– Determines loss rates and delays

• Traceroute– Path and estimated delay that packet followed in the

IP network– Sends multiple ICMP packets with increasing TTL,

discovering routers due to ICMP TTL expired messages– This can cause high variability in the reported delays

• NTP Sync Messages– Clock offset, round trip delay, dispersion info

exchange

• Various Statistical Probing Schemes– Delays and loss rates

57

SLA Monitoring Issues

• Client- versus operator-side monitoring and reporting

• Monitoring in multi-class network• Transport- and Application-level monitoring• Monitoring in an overlay network• Monitoring in a multi-service provider

environment (finding “the weakest link”)• Accuracy in monitoring

– Number of measurements, frequency of measurements, stability of results, confidence intervals

58

Measurement Pointsfor Verifying SLAs

• Distinguish between measuring within service provider cloud and end-to-end between customer nodes

59

Outline


60

• Server and site availability

• Balanced server and site load

• Rapid change

• Network and application flexibility

• Scalability

• Complex site administration

From Network Management to Service Management

ServerLoad Balancing

Advanced TrafficManagement

• Rapid problem diagnosis/ isolation

• Service level measurement

• Multi-tier resource monitoring

•Preferential Services

•Resource Provisioning

•Self-tuning

•Problemprevention

Service Level Control

Morino, Resonate

61

Network

Failure 18.2%

• ISP connection down• LAN segment overloaded

Service Reliability is Critical

Systems

Server Failure 20%OS Failure 24.6%

• CPU overloaded• NIC failure

Administration 8.7%

Applications

Failure 28.5%

• Process hung• Slowed database performance

Source: IDC Morino, Resonate

62

Traditional Traffic Management

• Single tier, single site,service level control

– Higher service levels– Better resource utilization– Multiple features to meet

unique needsContentServers

Internet

Traffic Management

User

Morino, Resonate

63

Basic LAN Solution Requirements

• Simple load balancing– Establish Virtual IP address (VIP)– Delivers scalable performance,

• Health checks and service monitoring– Look beyond layer 3/4 characteristics– Returned content, response times, etc.– Better information to determine server status– Use traffic management techniques to insulate user

from affects of server or software failure

Morino, Resonate

64

Advanced LAN Solution Requirements

• Complex traffic management– More intelligent policies for application state management– Enforce sophisticated user based policies– Inspection of application header

» URL parsing - Direct requests to systems with available content

• Functional segregation of Web site

» SSL Session IDs - Requirement to maintain persistence • Maintains application state• Multiple TCS sessions within since SSL session

» Cookies - More precise user identification and classification • Look through proxies, firewalls• Establish preferential services

– Integration with WAN solutions

Advanced Traffic Management Features Require Delayed BindingMorino, Resonate

65

Delayed Binding Connection Syn

Syn/Ack

Client (Browser) Server (HTTPd)

Data from server

Push (HTTP Get)

Bound to ‘Server’

Dela

yed

Bin

din

g

Imm

edia

te

Bin

din

g

Morino, Resonate

66

Delayed Binding Issues

• Push Packet contains URL, cookie, all application information (except port number)

• Must read application header to deliver advanced traffic management features

• Delayed Binding is the only way to see the application header before decision is made to ‘bind’ to server

Morino, Resonate

67

Be Careful What You Wish For...

• Now that you have the header, whatdo you do with it?

• Unstructured format, applicationspecific, might be encrypted

• CPU Sink hole!!!!

• Be sure to watch what happens to throughput when you turn on Delayed Binding features

Morino, Resonate

68

ContentServers

Internet

Traffic Management

User

App Servers Data Layer

Systems ManagementServer side

instrumentation

Deeper Visibility for Managing

Complex Infrastructures

Multi-tier service level control

• Instrument back-end systems

• Capture health and status

• Diagnose and isolate problems

• Take correctiveaction immediately!

Morino, Resonate

69

Redundant Site Implementation: Growth and

Failover• Multi-site service level

control• Higher service levels• Better resource utilization

• Not a networking solution• Not a performance issue

– POP persistencedominates issues

WAN Traffic Management

Internet

User

SF

NY

SF

Morino, Resonate

70

Management and Administration

is CrucialConsolidated view of multiple

sites

• Eases management of complexe-businesses

• Reduces costs associated withundetected problems

Sys Admin

Denver, CO

Enterprise Services Console

SF

NY

SF

Morino, Resonate

71

Closed-loop Real-Time Controlof IP-based Applications

Policy-Based

Control

Feedback

Feedback

IP-ApplicationTraffic

ManagementFunctions

IP-ApplicationTraffic

ManagementFunctions

Intelligent Service Management

SystemsManagement

Functions

SystemsManagement

Functions

Morino, Resonate

72

Resonate Case Study

• Central Dispatch– Software-based load balancer for servers on a LAN– Sophisticated policy-driven filtering, redirection, load balancing– Class of service support for server access

• Global Dispatch– Multi-site management, wide-area redirection, disaster recovery– Advanced Traffic Mapping capabilities:

» Sticky/persistent session support and sticky session failover » Directed Traffic Table directs users to predefined POP

– Configurable scheduling based on WAN latency and site load – POP failover handling – Advanced stats: avg. DNS response, POP hit rate, other QoS– Coexistence with existing DNS and load balancing architecture – Pass multiple IP addresses to client for browser-based failover – Weighted round-robin scheduling

http://www.resonate.com

73

Resonate Case Study

• Commander– End-to-End monitoring

» URL tests, host access tests, HTTP service availability tests» SNMP traps

– Test, statistics, and control features» Gather availability info: site + Web/app/DB servers» Process events: inaccessible file servers, db, net

congestion, etc. for reporting/initialization of user-defined action

– Features:» Rapid identification and resolution of site problems » Multi-tier resource monitoring of site servers » Identify problems before service levels are affected » Identify network trends essential to optimized site planning » User-defined service mgmt policies for automated control


74

Resonate Case Study

• Automated Control for Policy-Based Problem Resolution

– Sophisticated server-level control policies– Monitors events/processes them according to pre-defined rules &

action(s)– E.g., sending email/electronic pages, script invocation

• Examples of policy-based control include: – Schedule traffic from Web server w/ slow/failed backend app

server – Increase/decrease traffic to server when perf crosses thresholds – Enable backup content server in a Central Dispatch site when

one or more active content servers fail/become too busy – Monitor apps and server processes; restart any that fail


berkeley-helsinki summer course lecture #7: network measurement and monitoring

Documents