network-aware os
DESCRIPTION
Network-aware OS. ESCC Miami February 5, 2003. Tom Dunigan [email protected] Matt Mathis [email protected] Brian Tierney [email protected]. Roadmap. www.net100.org. Motivation Net100 project overview Web100 network probes & sensors protocol analysis and tuning Year 1 Results - PowerPoint PPT PresentationTRANSCRIPT
Network-aware OSESCC Miami
February 5, 2003
Tom Dunigan [email protected] Mathis [email protected] Tierney [email protected]
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Roadmap• Motivation• Net100 project overview
– Web100– network probes & sensors– protocol analysis and tuning
• Year 1 Results– A TCP tuning daemon– Tuning experiments
• Year 2– ongoing research, Web100 update (Mathis)
www.net100.org
DOE-funded project (Office of Science) $1M/yr, 3 yrs beginning 9/01 LBL, ORNL, PSC, NCAR
Net100 project objectives: (network-aware operating systems)• measure, understand, and improve end-to-end network/application performance• tune network protocols and applications (grid and bulk transfer)• first year emphasis: TCP bulk transfer over high delay/bandwidth nets
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Motivation• Poor network application performance
– High bandwidth paths, but app’s slow– Is it application? OS? network? … Yes– Often need a network “wizard”
• Changing: bandwidths– 9.6 Kbs… 1.5 Mbs ..45 …100…1000…? Gbs
• Unchanging: TCP– speed of light (RTT)– MTU (still 1500 bytes)– TCP congestion avoidance
• TCP is lossy by design !– 2x overshoot at startup, sawtooth– recovery after a loss can be very slow on today’s
high delay/bandwidth links– Non-congestive loss c*MSS/(RTT*p½)– Recovery proportional to MSS/RTT2
Linear recovery at 0.5 Mb/s!
Instantaneous bandwidth
Average bandwidth
Early startup losses
ORNL to NERSC ftp
8 Mbs
GigE/OC12 80ms RTT
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
TCP tuning• “enable” high speed
– need buffer = bandwidth*RTT - autotune ORNL/NERSC (80 ms, OC12) need 6 MB
– faster slow-start• avoid losses
– modified slow-start– reduce bursts– anticipate loss (ECN,Vegas?) – reorder threshold
• speed recovery– bigger MTU or “virtual MSS”– modified AIMD (0.5,1)– delayed ACKs, initial window, slow-start
increment• avoid congestion collapse• be fair (?) … intranets, QoS
ns simulation: 500 mbs link, 80 ms RTTPacket loss early in slow start.Standard TCP with del ACK takes 10 minutes to recover!
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Net100 components for tuning• TCP protocol analysis
– simulation/emulation– kernel tuning extensions
• Web100 Linux kernel (NSF) www.web100.org– instrumented TCP stack (IETF MIB draft)– 100+ variables per flow (/proc/web100)– socket open/close event notification– API and tools for tracing and tuning, e.g., bw tester:
http://firebird.ccs.ornl.gov:7123• Path characterization
– Network Tuning and Analysis Framework (NTAF)– both active and passive measurement
• iperf, pipechar• Web100 data augments probe data
– schedule probes and distribute/archive results– data base of measurements– NTAF/Net100 hosts at PSC, NCAR,LBL,ORNL, NERSC,CERN,UT,SLAC
• TCP tuning daemon
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
TCP Tuning Daemon• Work-around Daemon (WAD)
– tune unknowing sender/receiver at startup and/or during flow– Web100 kernel extensions
• pre-set windowscale to allow dynamic tuning• uses netlink to alert daemon of socket open/close (or poll)• besides existing Web100 buffer tuning, new tuning options
using WAD_* variables• knobs to disable Linux 2.4 caching, burst mgt., and sendstall
– config file with static tuning data• mode specifies dynamic tuning (Floyd AIMD, NTAF buffer size,
concurrent streams)
– daemon periodically polls NTAF for fresh tuning data– written in C (also python version)
WAD config file [bob] src_addr: 0.0.0.0 src_port: 0 dst_addr: 10.5.128.74 dst_port: 0 mode: 1 sndbuf: 2000000 rcvbuf: 100000 wadai: 6 wadmd: 0.3 maxssth: 100 divide: 1 reorder: 9 sendstall: 0 delack: 0 floyd: 1
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Experimental results (year 1)• Evaluating the tuning daemon in the wild
– emphasis: bulk transfers over high delay/bandwidth nets (Internet2, ESnet)– tests over: 10GigE,OC48, OC12, OC3, ATM/VBR, GigE,FDDI,100/10T,cable,
ISDN,wireless (802.11b),dialup– tests over NistNET 100T testbed
• Various TCP tuning options– buffer tuning– AIMD mods (including Floyd, both in-kernel and in WAD)– slow-start mods– parallel streams vs single tuned
• Results are anecdotal– more systematic testing is on-going – Your mileage may vary ….
Network professionals on a closed course. Do not attempt this at home.
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
WAD tuning results
Classic buffer tuning• ORNL to PSC, OC12, 80ms RTT• network-challenged app. gets 10 Mbs• same app., WAD/NTAF tuned buffer gets 143 Mbs
Virtual MSS• tune TCP’s additive increase (WAD_AI)• add k segments per RTT during recovery• k=6 like GigE jumbo frame, but:
•interrupt rate not reduced•doesn’t do k segments for initial window
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Tuning around Linux (2.4) TCP
• Tunable ssthresh caching• Tunable “sendstall” (TXQUELEN) 600 mbs
Amsterdam-Chicago GigE via 10GigE, 100 ms RTT
sendstalls
UDP event
Floyd AIMD
Standard AIMD
Floyd AIMD: as cwnd grows increase AI and decrease MD, do the reverse when cwnd shrinks
Added to Net100 kernel and to WAD (WAD tunable)
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
WAD tuningModified slow-start and AI
• ORNL to NERSC, OC12, 80 ms RTT• often losses in slow-start• WAD tuned Floyd slow-start and fixed AI (6)
WAD-tuned AIMD and slow-start• ORNL to CERN, OC12, 150ms RTT• parallel streams AIMD (1/(2k),k)• WAD-tuned single stream (0.125,4)
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
GridFTP tuning
Can tuned single stream compete with parallel streams?Mostly not with “equivalence” tuning, but sometimes…. Parallel streams have slow-start advantage.
WAD can divide buffer among concurrent flows—fairer/faster? Tests inconclusive so far…. Testing on real Internet is problematic.
Is there a “congestion metric”? Per unit of time? Flow Mbs congestion re-xmitsuntuned 28 4 30tuned 74 5 295parallel 52 30 401
untuned 25 7 25tuned 67 2 420parallel 88 17 440
Data/plots from Web100 tracerBuffers: 64K I/O, 4MB TCP (untuned 64K TCP: 8 mbs, 200s)
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Ongoing/Planned Net100 research (year 2)– analyze effectiveness/fairness of current tuning options
• simulation• emulation• on the net (systematic tests)
– NTAF probes -- characterizing a path to tune a flow• router data (passive)• monitoring applications with Web100• latest probe tools
– additional tuning algorithms• Vegas• slow-start increment, reorder resiliance, delayed ACKs• non-TCP (SABUL, FOBS, TSUNAMI, ?)• identify non-congestive loss, ECN?
– parallel/multipath selection/tuning– WAD-to-WAD tuning– jumbo frames experiments… the quest for bigger and bigger MTUs– more user-friendly, usable accelerants– port to Cray X1 network front-end– port to other OS’s
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Future TCP tuningReorder threshold
• seeing more out of order packets• WAD tune a bigger reorder threshold for path
• 40x improvement!• Linux 2.4 does a good job already
• adjusts and caches reorder threshold• “undo” congestion avoidance
Delayed ACKs• WAD could turn off delayed ACKs -- 2x improvement in recovery rate and slow-start• Linux 2.4 already turns off delayed ACKs for initial slow-start
ns simulation: 500 mbs link, 80 ms RTTPacket loss early in slow-start.Standard TCP with del ACK takes 10 minutes to recover!NOTE aggressive static AIMD (Floyd pre-tune)
LBL to ORNL (using our TCP-over-UDP) : dup3 case had 289 retransmits, but all were unneeded!
UT-BATTELLEU.S. Department of Energy Oak Ridge National Laboratory
Summary
• Novel approaches– non-invasive dynamic tuning of legacy applications– using TCP to tune TCP (Web100)– tuning on a per flow/path
• Effective evaluation framework– protocol analysis and tuning + net/app/OS debugging– out-of-kernel tuning
• Beneficial interactions– TCP protocols (Floyd, Wu Feng (DRS), Web100, parallel/non-TCP)– Path characterization research (SciDAC, CAIDA, Pinger, pathrate,SCNM)– Scientific application and Data grids (SciDAC, CERN)
• Performance improvements
www.net100.org