transport benchmarking panel discussion

PFLDnet, Nara, Japan 2-3 Feb 2006, R. Hughes-Jones Manchester1

Transport Benchmarking

Panel Discussion

Richard Hughes-Jones The University of Manchester

www.hep.man.ac.uk/~rich/ then “Talks”


Packet Loss and new TCP Stacks TCP Response Function

Throughput vs Loss Rate – further to right: faster recovery Drop packets in kernel

MB-NG rtt 6ms DataTAG rtt 120 ms


Packet Loss and new TCP Stacks TCP Response Function

UKLight London-Chicago-London rtt 177 ms 2.6.6 Kernel

Agreement withtheory good

Some new stacksgood at high loss rates

sculcc1-chi-2 iperf 13Jan05

1

10

100

1000

100100010000100000100000010000000100000000Packet drop rate 1 in n

TC

P A

chie

vable

thro

ughput

Mbit/

s

A0 1500

A1 HSTCP

A2 Scalable

A3 HTCP

A5 BICTCP

A8 Westwood

A7 Vegas

A0 Theory

Series10

Scalable Theory

sculcc1-chi-2 iperf 13Jan05

0

100

200

300

400

500

600

700

800

900

1000

100100010000100000100000010000000100000000Packet drop rate 1 in n

TC

P A

chie

vable

thro

ughput

Mbit/

s

A0 1500

A1 HSTCP

A2 Scalable

A3 HTCP

A5 BICTCP

A8 Westwood

A7 Vegas


TCP Throughput – DataTAG Different TCP stacks tested on the DataTAG Network rtt 128 ms Drop 1 in 106

High-SpeedRapid recovery

ScalableVery fast recovery

StandardRecovery would

take ~ 20 mins


MTU and Fairness

Two TCP streams share a 1 Gb/s bottleneck RTT=117 ms MTU = 3000 Bytes ; Avg. throughput over a period of 7000s = 243 Mb/s MTU = 9000 Bytes; Avg. throughput over a period of 7000s = 464 Mb/s Link utilization : 70,7 %

Starlight (Chi)Starlight (Chi)CERN (GVA)CERN (GVA)

RR RRGbE GbE SwitchSwitch

Host #1Host #1POS 2.5POS 2.5 GbpsGbps1 GE1 GE

1 GE1 GE

Host #2Host #2

Host #1Host #1

Host #2Host #2

1 GE1 GE

1 GE1 GE

BottleneckBottleneck

Throughput of two streams with different MTU sizes sharing a 1 Gbps bottleneck

0

100

200

300

400

500

600

700

800

900

1000

0 1000 2000 3000 4000 5000 6000

Time(s)

Thro

ughput

(Mbps)

MTU = 3000 Byte

Average over the life of the connection MTU = 3000 Byte

MTU = 9000 Byte

Average over the life of the connection MTU = 9000 Byte

Sylvain Ravot DataTag 2003


RTT and Fairness

SunnyvaleSunnyvaleStarlight (Chi)Starlight (Chi)

CERN (GVA)CERN (GVA)

RR RRGbE GbE SwitchSwitch

Host #1Host #1

POS 2.5POS 2.5 Gb/sGb/s1 GE1 GE

1 GE1 GE

Host #2Host #2

Host #1Host #1

Host #2Host #2

1 GE1 GE

1 GE1 GE

BottleneckBottleneck

RRPOS 10POS 10 Gb/sGb/sRR10GE10GE

Two TCP streams share a 1 Gb/s bottleneck CERN <-> Sunnyvale RTT=181ms ; Avg. throughput over a period of 7000s = 202Mb/s CERN <-> Starlight RTT=117ms; Avg. throughput over a period of 7000s = 514Mb/s MTU = 9000 bytes Link utilization = 71,6 %

Throughput of two streams with different RTT sharing a 1Gbps bottleneck

0

100

200

300

400

500

600

700

800

900

1000

0 1000 2000 3000 4000 5000 6000 7000

Time (s)

Thro

ughput

(Mbps)

RTT=181ms

Average over the life of the connection RTT=181ms

RTT=117ms

Average over the life of the connection RTT=117ms

Sylvain Ravot DataTag 2003


Chose 3 paths from SLAC (California) Caltech (10ms), Univ Florida (80ms), CERN (180ms)

Used iperf/TCP and UDT/UDP to generate traffic

Each run was 16 minutes, in 7 regions

TCP Sharing & Recovery: Methodology (1Gbit/s)

Ping 1/s

Iperf or UDT

ICMP/ping traffic

TCP/UDPbottleneck

iperf

SLACCaltech/UFL/CERN

2 mins 4 mins

Les Cottrell PFLDnet 2005


Low performance on fast long distance paths AIMD (add a=1 pkt to cwnd / RTT, decrease cwnd by factor b=0.5 in congestion) Net effect: recovers slowly, does not effectively use available bandwidth, so poor

throughput Unequal sharing

TCP Reno single stream

Congestion has a dramatic effect

Recovery is slow

Increase recovery rate

SLAC to CERN

RTT increases when achieves best throughput

Les Cottrell PFLDnet 2005

Remaining flows do not take up slack when flow removed


Hamilton TCP One of the best performers

Throughput is high Big effects on RTT when achieves best throughput Flows share equally

Appears to need >1 flow toachieve best throughput

Two flows share equally

SLAC-CERN

> 2 flows appears less stable


Implementation problems: SACKs … Look into what’s happening at the algorithmic level e.g. with web100:

Strange hiccups in cwnd only correlation is SACK arrivals The SACK Processing is inefficient for large bandwidth delay products

Sender write queue (linked list) walked for: Each SACK blockTo mark lost packetsTo re-transmit

Processing so long input Q becomes full Get Timeouts

Scalable TCP on MB-NG with 200mbit/sec CBR Background

Yee-Ting Li


TCP Stacks & CPU Load Real User problem! End host TCP flow at 960 Mbit/s with rtt 1 ms falls to 770 Mbit/s when rtt 15 ms

mk5-606-g7_10Dec05

0.0010.0020.0030.0040.0050.0060.0070.0080.0090.00

100.00

0 2 4 6 8 10 12 14 16 18 20nice large value - low priority

% C

PU

mo

de

se

nd

kernel

user

nice

idle

no CPU load

0

200

400

600

800

1000


Thro

ughput

Mbit/s

no CPU load

1.2GHz PIII rtt 1 ms TCP iperf 980 Mbit/s

Kernel mode 95% Idle 1.3 % CPULoad with nice priority

Throughput falls as priorityincreases

No Loss No Timeouts

Not enough CPU power

mk5-606-g7_17Jan05

0.0010.0020.0030.0040.0050.0060.0070.0080.0090.00

100.00


% C

PU

mo

de

se

nd

kernel

user

nice

idle

no CPU load

0

200

400

600

800

1000


Thro

ughput

Mbit/s

no CPU load

2.8 GHz Xeon rtt 1 ms TCP iperf 916 Mbit/s

Kernel mode 43% Idle 55% CPULoad with nice priority

Throughput constant as priority increases

No Loss No Timeouts

Kernel mode includes TCP stackand Ethernet driver


Check out the end host: bbftp What is the end-host doing with your protocol? Look at the PCI-X buses 3Ware 9000 controller RAID0 1 Gbit Ethernet link 2.4 GHz dual Xeon ~660 Mbit/s

PCI-X bus with RAID Controller

PCI-X bus with Ethernet NIC

Read from diskfor 44 ms every 100ms

Write to Networkfor 72 ms


Transports for LightPaths For a Lightpath with a BER 10-16 i.e. a packet loss rate 10-12 or

1 loss in about 160 days, what do we use?

Host to host Lightpath One Application No congestion Lightweight framing

Lab to Lab Lightpath Many application share Classic congestion points TCP stream sharing and recovery Advanced TCP stacks


Transports for LightPaths

Some applications suffer when using TCP may prefer to use UDP DCCP XCP …

E.g. With e-VLBI the data wave-front gets distorted and correlation fails

Consider & include other transport layer protocols when defining tests.

User Controlled Lightpaths Grid Scheduling of

CPUs & Network Many Application flows No congestion Lightweight framing


A Few Items for Discussion Achievable Throughput Sharing link Capacity (OK what is sharing?) Convergence time Responsiveness rtt fairness (OK what is fairness?) mtu fairness TCP friendliness Link utilisation (by this flow or all flows) Stability of Achievable Throughput Burst behaviour Packet loss behaviour Packet re-ordering behaviour Topology – maybe some “simple” setups Background or cross traffic - how realistic is needed? – what protocol mix? Reverse traffic Impact on the end host – CPU load, bus utilisation, Offload Methodology – simulation, emulation and Real links ALL help


Any Questions?


Backup Slides


10 Gigabit Ethernet: Tuning PCI-X 16080 byte packets every 200 µs Intel PRO/10GbE LR Adapter PCI-X bus occupancy vs mmrbc

Measured times Times based on PCI-X times from

the logic analyser Expected throughput ~7 Gbit/s Measured 5.7 Gbit/s

0

5

10

15

20

25

30

35

40

45

50

0 1000 2000 3000 4000 5000Max Memory Read Byte Count

PC

I-X

Tra

nsfe

r tim

e u

s

0

1

2

3

4

5

6

7

8

9

PC

I-X

Tra

nsfe

r ra

te G

bit/s

Measured PCI-X transfer time usexpected time usrate from expected time Gbit/s Max throughput PCI-X

Kernel 2.6.1#17 HP Itanium Intel10GE Feb04

0

2

4

6

8

10

0 1000 2000 3000 4000 5000Max Memory Read Byte Count

PC

I-X

Tra

nsfe

r tim

e

us

measured Rate Gbit/srate from expected time Gbit/s Max throughput PCI-X

mmrbc1024 bytes

mmrbc2048 bytes

mmrbc4096 bytes5.7Gbit/s

mmrbc512 bytes

CSR Access

PCI-X Sequence

Data Transfer

Interrupt & CSR Update


More Information Some URLs 1 UKLight web site: http://www.uklight.ac.uk MB-NG project web site: http://www.mb-ng.net/ DataTAG project web site: http://www.datatag.org/ UDPmon / TCPmon kit + writeup:

http://www.hep.man.ac.uk/~rich/net Motherboard and NIC Tests:

http://www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt& http://datatag.web.cern.ch/datatag/pfldnet2003/ “Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards” FGCS Special issue 2004 http:// www.hep.man.ac.uk/~rich/

TCP tuning information may be found at:http://www.ncne.nlanr.net/documentation/faq/performance.html & http://www.psc.edu/networking/perf_tune.html

TCP stack comparisons:“Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks” Journal of Grid Computing 2004

PFLDnet http://www.ens-lyon.fr/LIP/RESO/pfldnet2005/ Dante PERT http://www.geant2.net/server/show/nav.00d00h002


Lectures, tutorials etc. on TCP/IP: www.nv.cc.va.us/home/joney/tcp_ip.htm www.cs.pdx.edu/~jrb/tcpip.lectures.html www.raleigh.ibm.com/cgi-bin/bookmgr/BOOKS/EZ306200/CCONTENTS www.cisco.com/univercd/cc/td/doc/product/iaabu/centri4/user/scf4ap1.htm www.cis.ohio-state.edu/htbin/rfc/rfc1180.html www.jbmelectronics.com/tcp.htm

Encylopaedia http://www.freesoft.org/CIE/index.htm

TCP/IP Resources www.private.org.il/tcpip_rl.html

Understanding IP addresses http://www.3com.com/solutions/en_US/ncs/501302.html

Configuring TCP (RFC 1122) ftp://nic.merit.edu/internet/documents/rfc/rfc1122.txt

Assigned protocols, ports etc (RFC 1010) http://www.es.net/pub/rfcs/rfc1010.txt & /etc/protocols

More Information Some URLs 2

http://www.nv.cc.va.us/home/joney/tcp_ip.htm

http://www.cs.pdx.edu/~jrb/tcpip.lectures.html

http://www.raleigh.ibm.com/cgi-bin/bookmgr/BOOKS/EZ306200/CCONTENTS


High Throughput DemonstrationsManchester rtt 6.2 ms (Geneva) rtt 128 ms

man03lon01

2.5 Gbit SDHMB-NG Core

1 GEth1 GEth

Cisco GSRCisco GSRCisco7609

Cisco7609

London (Chicago)

Dual Zeon 2.2 GHz Dual Zeon 2.2 GHz

Send data with TCPDrop Packets

Monitor TCP with Web100


Drop 1 in 25,000 rtt 6.2 ms Recover in 1.6 s

High Performance TCP – MB-NG

Standard HighSpeed Scalable


FAST TCP vs newReno

Traffic flow Channel #1 : newRenoTraffic flow Channel #1 : newReno Traffic flowTraffic flow Channel #2: FASTChannel #2: FAST

Utilization: 70%Utilization: 70%

Utilization:Utilization:

90%90%


Fast

As well as packet loss, FAST uses RTT to detect congestion RTT is very stable: σ(RTT) ~ 9ms vs 37±0.14ms for the others

SLAC-CERN

Big drops in throughput which take several seconds to recover from

2nd flow never gets equal share of bandwidth

transport benchmarking panel discussion

Documents