high performance active end-to- end network monitoring · problem: file copy applications • some...

1

High Performance Active End-to-end Network Monitoring

Les Cottrell, Connie Logg, Warren Matthews, Jiri Navratil, Ajay Tirumala – SLAC

Prepared for the 1st SCAMPI Workshop, Amsterdam, January 2003

http://www.slac.stanford.edu/grp/scs/net/talk/caida-jun02.htmlPartially funded by DOE/MICS Field Work Proposal on

Internet End-to-end Performance Monitoring (IEPM), by the SciDAC base program, and also supported by IUPAP

2

Outline• High performance testbed

– Challenges for measurements at high speeds• Infrastructure for regular high-performance

measurements

3

Testbed12 cpuservers4 disk servers

GSR

7606

6 cpuservers

Sunnyvale

7606

6 cpuservers4 disk servers

OC192/POS(10Gbits/s)

2.5Gbits/s

T640

4

Problems: Achievable TCP throughput

– GE for RTT from California to Geneva (RTT=182ms) slow start takes ~ 5s

– So for slow start to contribute < 10% to throughput measured need to run for 50s

– About double for Vegas/FAST TCP

• So developing Quick Iperf – Use web100 to tell when out of slow start– Measure for 1 second afterwards– 90% reduction in duration and bandwidth

used

• Typically use iperf– Want to measure stable throughput (i.e. after slow start)– Slow start takes quite long at high BW*RTT

5

Examples

24ms RTT

140ms RTT

6

Problems: Achievable bandwidth• Typically use packet pair

dispersion or packet size techniques (e.g. pchar, pipechar, pathload, pathchirp, …)– In our experience current

implementations fail for > 155Mbits/s and/or take a long time to make a measurement

• Developed a simple practical packet pair tool ABwE– Typically uses 40 packets, tested up to 950Mbits/s– Low impact– Few seconds for measurement (can use for real-time monitoring)

7

ABwE ResultsScreen shot of real time ABwE display tool for monitoring at SC2002

Measurements 1 minute separation

Note every hour sudden dip in available bandwidth

8

Problem: File copy applications• Some tools (e.g. bbcp will not allow a large enough

window – currently limited to 2MBytes)• Same slow start problem as iperf• Need big file to assure not cached

– E.g. 2GBytes, at 200 Mbits/s takes 80s to transfer, even longer at lower speeds

– Looking at whether can get same effect as a big file but with a small (64MByte) file, by playing with commit

• Many more factors involved, e.g. adds file system, disks speeds, RAID etc.

• Maybe best bet is to let the user measure it for us.

9

Passive (Netflow) Measurements• Use Netflow measurements from border router

– Netflow records time, duration, bytes, packets etc./flow– Calculate throughput from Bytes/duration– Validate vs. iperf, bbcp etc.– No extra load on network, provides other SLAC & remote hosts &

applications, ~ 10-20K flows/day, 100-300 unique pairs/day– Tricky to aggregate all flows for single application call

• Look for flows with fixed triplet (sce & dst addr, and port)• Starting at the same time +- 2.5 secs, ending at roughly same

time - needs tuning missing some delayed flows• Check works for known active flows• To ID application need a fixed server port (bbcp peer-to-peer

but have modified to support)• Investigating differences with tcpdump

– Aggregate throughputs, note number of flows/streams

10

Mbi

ts/s

Date0

450Iperf SLAC to Caltech (Feb-Mar ’02)

0

80M

bits

/s

Date

Bbftp SLAC to Caltech (Feb-Mar ’02)

+ Active+ Passive

+ Active+ Passive

Iperf matches well

BBftp reports underwhat it achieves

Pass

ive

Active

Passive vsactive

11

Passive bbftp from SLAC to IN2P3 (unrelated to active measurements)

TCP fair share results in the green flow (60 streams) getting twice the throughput of magenta flow (30 streams) when both run simultaneously

Adding the flows together we see we can get about 80Mbits/s.

12

Problems: Host configuration• Need fast interface and hi-

speed Internet connection• Need powerful enough host• Need large enough

available TCP windows• Need enough memory• Need enough disk space

13

Windows and Streams• Well accepted that multiple streams and/or big

windows are important to achieve optimal throughput

• Can be unfriendly to others• Optimum windows & streams changes with changes

in path, hard to optimize• For 3Gbits/s and 200ms RTT need a 75MByte

window

14

Even with big windows (1MB)

still need multiple streams with stock

TCP

• Above knee performance still improves slowly, maybe due to squeezing out others and taking more than fair share due to large number of streams

• ANL, Caltech & RAL reach a knee (between 2 and 24 streams) above this gain in throughput slow

15

Configurations 1/2• Do we measure with standard

parameters, or do we measure with optimal?

• Need to measure all to understand effects of parameters, configurations:– Windows, streams, txqueuelen,

TCP stack, MTU– Lot of variables

• Examples of 2 TCP stacks– FAST TCP no longer needs

multiple streams, this is a major simplification (reduces # variables by 1)

Stock TCP,1500B MTU65ms RTT

FAST TCP,1500B MTU65ms RTTFAST TCP,

1500B MTU65ms RTT

16

Configurations: Jumbo frames

• Become more important at higher speeds:– Reduce interrupts to CPU and

packets to process– Similar effect to using multiple

streams (Hacker) • Jumbo can achieve >95%

utilization SNV to CHI with 1 or multiple stream up to Gbit/s

• Factor 5 improvement over 1500B MTU throughput for stock TCP

• Alternative to a new stack

17

Repetitive long term measurements

18

IEPM-BW = PingER NG• Driven by data replication needs of HENP, PPDG,

DataGrid– No longer ship plane/truck loads of data

• Latency is poor• Now ship all data by network (TB/day today, double each year)

– Complements PingER, but for high performance nets• Build an infrastructure to make E2E network (e.g.

iperf, packet pair dispersion) & application (FTP) measurements for high-performance A&R networking

• Started SC2001

19

Tasks• Develop/deploy a simple, robust ssh based E2E app

& net measurement and management infrastructure for making regular measurements– Major step is setting up collaborations, getting trust,

accounts/passwords– Can use dedicated or shared hosts, located at borders or

with real applications– COTS hardware & OS (Linux or Solaris) simplifies

application integration• Integrate base set of measurement tools (ping, iperf,

bbcp …), provide simple (cron) scheduling• Develop data extraction, reduction, analysis,

reporting, simple forecasting & archiving

20

Purposes• Compare & validate tools

– With one another (pipechar vs pathload vs iperf or bbcp vs bbftp vs GridFTP vs Tsunami)

– With passive measurements, – With web100

• Evaluate TCP stacks (FAST, Sylvain, HS TCP, Frank Kelley …)– Trouble shooting– Set expectations, planning– Understand

• requirements for high performance, jumbos• performance issues, in network, OS, cpu, disk/file system etc.

– Provide public access to results for people & applications

21

Measurement Sites• Production, i.e. choose own remote hosts, run monitor

themselves: – SLAC (40) San Francisco, FNAL (2) Chicago, INFN (4) Milan,

NIKHEF (32) Amsterdam, APAN Japan (4)• Evaluating toolkit:

– Internet 2 (Michigan), Manchester University, UCL, Univ. Michigan, GA Tech (5)

• Also demonstrated at:– iGrid2002, SC2002

• Using on Caltech / SLAC / DataTag / Teragrid / StarLight / SURFnet testbed

• If all goes well 30-60 minutes to install monitoring host, often problems with keys, disk space, ports blocked, not registered in DNS, need for web access, disk space

• SLAC monitoring over 40 sites in 9 countries

22

SNVSLAC

CHIESnet

NY

NERSC

LANL

ORNL

TRIUMFKEK

Abilene

SLAC

SNVFNAL

ANL

NIKHEF

CERN

IN2P3

CERN

Calte

ch

SDSC

BNLJAnet

HSTN

SEAATLCLV

RAL

UCL UManc

DLNNW

NY

UTDallasUMich I2

SOX

UFL

APANRIKEN INFN-Roma

INFN-Milan

CESnet

APANGeant

Stanford

CalREN

Rice

ORN

JLAB

GARR

CAnet Surfnet

Stanford

Renater

220

56

220

110

42

68

65

84

31

323

278

31

220

433

15

478

226

4411

125

133

93

17

80

18

290

17120

300

95

IPLS

UIUC

140

Monitor

100MbpsGE

23

Results• Time series data, scatter plots, histograms• CPU utilization required (MHz/Mbits/s) jumbo and

standard, new stacks• Forecasting• Diurnal behavior characterization• Disk throughput as function of OS, file system,

caching• Correlations with passive, web100

24www.slac.stanford.edu/comp/net/bandwidth-tests/antonia/html/slac_wan_bw_tests.html

25

Excel

26

Problem DetectionProblem Detection• Must be lots of people working on this ?• Our approach is:

– Rolling averages if have recent data– Diurnal changes

27

Rolling Averages

EWMA~Avg of last 5 points +- 2%

28

Indicate “diurnalness” by δγ, can look at previous week at same time, if do not have recent measurements

Fit to α*sin(t+φ)+γ

29

AlarmsAlarms• Too much to keep track of• Rather not wait for complaints• Automated Alarms• Rolling average à la RIPE-TTM

32

ActionAction• However concern is generated

– Look for changes in traceroute– Compare tools– Compare common routes– Cross reference other alarms

33

Next steps• Rewrite (again) based on experiences

– Improved ability to add new tools to measurement engine and integrate into extraction, analysis

• GridFTP, tsunami, UDPMon, pathload …

– Improved robustness, error diagnosis, management• Need improved scheduling• Want to look at other security mechanisms

34

More Information• IEPM/PingER home site:

– www-iepm.slac.stanford.edu/• IEPM-BW site

– www-iepm.slac.stanford.edu/bw• Quick Iperf

– http://www-iepm.slac.stanford.edu/bw/iperf_res.html• ABwE

– Submitted to PAM2003

35

Passive vs Active correlations

Stro

ng

36

IEPM-BW Uses/deliverables• Understand and identify resources needed to achieve high

throughput performance for Grid and other data intensive applications

• Provide access to archival and near real-time data and results for eyeballs and applications:– planning and expectation setting, see effects of upgrades– assist in trouble-shooting problems by identifying what is

impacted, time and magnitude of changes and anomalies– as input for application steering (e.g. data grid bulk data transfer),

changing configuration parameters– for forecasting and further analysis

• Identify critical changes in performance, record and notify administrators and/or users

• Provide a platform for evaluating new network tools (e.g. pathrate, pathload, GridFTP, INCITE, UDPmon …)

• Provide measurement/analysis/reporting suite for Grid & hi-perf sites

IEPM-BW Deployment

38

SNV

SLAC

CHI

ESnetNY

Stanford

CalREN

NERSC

LANLJLAB

TRIUM

F

KEK

Abilene

SLAC

SNVFNAL

ANLNIK

HEF

CERN

IN2P3

CERN

CALT

ECH

SDSC

BNL JAnet

HSTN

SEAATL

CLVIPLS

RAL

UCL UManc

DLNNW

NY

RiceUTDallas

NCSAUMich I2

SOX

UFLAPANRIKEN INFN-Roma

INFN-Milan

CESnet

APAN

Geant

EDGPP

DG

/GriP

hyN

Monitoring Site

39

Early results• Reasonable estimates of

throughput achievable with 10 sec iperf meas.

• Multiple streams and big windows are critical– Improve over default

by 5 to 60.– There is an optimum

windows*streams• Continuous data at 90

min intervals from SLAC to 33 hosts in 8 countries since Dec ‘01

40

Early results• 1MHz ~ 1Mbps• Bbcp mem to mem tracks iperf• BBFTP & bbcp disk to disk

tracks iperf until disk performance limits

• Bandwidth estimators fail above 100Mbits/s

• High throughput affects RTT for others– E.g. to Europe adds ~ 100ms

• Archival raw throughput data & graphs already available via http

41

E.g. Iperf vs file copy disk to disk

0 400

100

Iperf TCP Mbits/s

File cop y d isk-to-di sk

Fast Ethernet

OC3Disklimited

Over 60Mbits/s iperf >> file copy

42

Disk performance• It matters for the applications• Depends on:

– disk sub-system– file system (nfs, ufs, /tmp), – caching, can be cached for long time, can improve

throughput by factor of 10 or more• Read speed varies from 4MB/s to 230MB/s

measured for 30 remote hosts (depends on caching)• Uncached write speeds vary from 1MByte/s to

29MBytes/s• If disk speed < network speed, no need to measure

network, so need parallelizing of disks & servers

43

0 400

300

Iperf TCP Mbits/sWorking with Developer. Not using TCP. Pipechar disagrees badly above 100Mbits/s (6 hosts, 20%),50% of hosts have reasonable agreementTypical of several bw prediction tools.

Pipe

char

min

thro

ughp

tMbi

ts/s E.g. iperf vs pipechar

44

Web100 vs Iperf throughputs

• Web100 throughput = Streams*DataBytesOut*8 / (max(iperfStreamDur)) / 10^6

• Looks like good agreement

• Nice sanity check is to see if Web100 sees the same throughput as iperf reports

45

Web100 estimated vs

observed throughput

• Est BW~C*Streams*MSS/(RTT*sqrt(loss)) (Mathis et. al. & Hacker)• Measure retransmissions using Web100, loss~PktsRetrans/PktsOut• Note high degree of correlation (R^2 >0.8)• If window large enough ((>=128KB) then ~ common relation,

window threshold varies from link to link

46

Effect of saturation

• If keep below knee then RTT stays low

• Could also use congestion signals

• Could use to optimize, e.g. throttle back app

47

Forecasting• Given access to the data one can do real-time forecasting for

– TCP bandwidth, file transfer/copy throughput• E.g. NWS, Predicting the Performance of Wide Area Data Transfers by

Vazhkudai, Schopf & Foster

• Developing simple prototype using average of previous measurements– Validate predictions versus observations– Get better estimates to adapt frequency of active measurements &

reduce impact• Also use ping RTTs and route information

– Look at need for diurnal corrections– Use for steering applications

• Working with NWS for more sophisticated forecasting• Can also use on demand bandwidth estimators (e.g.

pipechar, but need to know range of applicability)

48

Forecast results

13%+-8%

14%+-12%

15%+-13%

23%+- 18%

13%+- 11%

Average % error

pipecharbbftpBbcp disk

Bbcp mem

Iperf TCP

33 nodes

Predict=Moving average of last 5 measurements +- σIperf TCP throughput SLAC to Wisconsin, Jan ‘02

100

60M

bits

/sx

PredictedObserved

% average error = average(abs(observe-predict)/observe)

49

Impact on Others• Make ping measurements with & without iperf

loading– Loss loaded(unloaded)– RTT

• Looking at how to avoid impact: e.g. QBSS/LBE, application pacing, control loop on stdev(RTT) reducing streams, want to avoid scheduling

50

Possible HEP usage of QBSS• Apply priority to lower volume interactive

voice/video-conferencing and real time control• Apply QBSS to high volume data replication• Leave the rest as Best Effort• Since 40-65% of bytes to/from SLAC come from a

single application, we have modified to enable setting of TOS bits

• Need to identify bottlenecks and implement QBSS there

• Bottlenecks tend to be at edges so hope to try with a few HEP sites

51

Experiences• Getting ssh accounts and resources on remote hosts

– Tremendous variation in account procedures from site to site, takes up to 7 weeks, requires knowing somebody who cares, sites are becoming increasingly circumspect

– Steep learning curve on ssh, different versions • Getting disk space for file copies (100s Mbytes) • Diversity of OSs, userids, directory structures, where to find perl, iperf ..., contacts

• Required database to track • Also anonymizes hostnames, tracks code versions, whether to execute command (e.g. no ping

if site blocks ping) & with what options, • Developed tools to download software and to check remote configurations

– Remote server (e.g. iperf) crashes • Start & kill server remotely for each measurement

– Commands lock up or never end: – Time out all commands

• Hung processes need to be killed (or else fill up sockets with CLOSE-WAIT)• Some commands (e.g. pipechar) take a long time, others (disk throughput, pathrate)

have results that change infrequently – so have different schedules • AFS tokens to allow access to .ssh identity timed out, used trscron• Protocol port blocking

– Ssh following Xmas attacks; bbftp, iperf ports, big variation between sites – Wrote analyses to recognize and track problems and work with site contacts

• Ongoing issue, especially with increasing need for security, and since we want to measure inside firewalls close to real applications

52

Next steps• Develop/extend management, analysis, reporting, navigating tools – improve

robustness, manageability, workaround ssh anomalies • Get improved forecasters (NWS) and quantify how they work, provide tools to

access • Optimize intervals (using forecasts, and lighter weight measurements) and durations • Evaluate self rate limiting application (bbcp), look at using Web100 for feedback

loop• Extend analysis of passive Netflow measurements • Add gridFTP (with Allcock@ANL), UDPmon (RHJ@manchester) & new BW

measurers – netest (Jin@LBNL), pathrate, pathload (Dovropolis@Udel) – Make early data available via http to interested & “friendly” researchers – CAIDA for correlation and validation of Pipechar & iperf etc. (sent

documentaion) – NWS for forecasting with UCSB (sent documentation)

• Understand correlations, validate various tools, choose optimum set• Make data available by std methods (e.g. MDS, GMA, …) – with Dantong@BNL,

Jenny Schopf@ANL & Tierney@LBNL • Make tools portable, set up other monitoring sites, e.g. PPDG sites

– SLAC ported to Linux – Currently porting measurement tools to Manchester – Will work with INFN/Trieste & FNAL to port to other sites

high performance active end-to- end network monitoring · problem: file copy applications • some...

Documents