nordunet 2003, reykjavik, iceland, 26 august 2003 high-performance transport protocols for...

37
NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge, UK S. Ravot, Caltech, USA J.P. Martin-Flatin, CERN, Switzerland

Upload: earl-mcdowell

Post on 01-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

High-Performance Transport Protocols for Data-Intensive

World-Wide Grids

T. Kelly, University of Cambridge, UKS. Ravot, Caltech, USA

J.P. Martin-Flatin, CERN, Switzerland

Page 2: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

2NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

Outline

Overview of DataTAG project Problems with TCP in data-intensive Grids

Problem statement Analysis and characterization

Solutions: Scalable TCP GridDT

Future Work

Page 3: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

3NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

Overview ofDataTAG Project

Page 4: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

4NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

Member Organizations

http://www.datatag.org/

Page 5: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

5NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

Project Objectives

Build a testbed to experiment with massive file transfers (TBytes) across the Atlantic

Provide high-performance protocols for gigabit networks underlying data-intensive Grids

Guarantee interoperability between major HEP Grid projects in Europe and the USA

Page 6: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

6NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

DataTAG Testbed

r04gvaCisco7606

r04chiCisco7609 stm16

(DTag)

r05chi-JuniperM10

r06chi-Alcatel7770

r05gva-JuniperM10

r06gvaAlcatel7770

SURFNET

stm16(Colt)backup+projects

s01gvaExtreme S1i

w01gvaw02gvaw03gvaw04gvaw05gvaw06gvaw20gvav02gvav03gva

7x

w01chiw02chi

v10chiv11chiv12chiv13chi

s01chiExtreme S5i

8x

VTHD/INRIA

stm16(FranceTelecom)

Chicago Geneva

2x

ONS15454

Alcatel 1670 Alcatel 1670

SURFNETCESNET

ONS15454

stm64(GC)

CNAFGEANTcernh7

w03chiw04chiw05chi

3x 3x 2x

w02chi

1000baseSX

SDH/Sonet

1000baseT

10GbaseLX

w03

w06chi

w01bol

CCC tunnel

stm64

Edoardo Martelli

Page 7: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

7NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

Network Research Activities

Enhance performance of network protocols for massive file transfers: Data-transport layer: TCP, UDP, SCTP

QoS: LBE (Scavenger) Equivalent DiffServ (EDS)

Bandwidth reservation: AAA-based bandwidth on demand Lightpaths managed as Grid resources

Monitoring

Rest of this talkRest of this talk

Page 8: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

8NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

Problems with TCP inData-Intensive Grids

Page 9: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

9NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

Problem Statement

End-user’s perspective: Using TCP as the data-transport protocol for Grids

leads to a poor bandwidth utilization in fast WANs

Network protocol designer’s perspective: TCP is inefficient in high bandwidth*delay

networks because: few TCP implementations have been tuned for gigabit

WANs TCP was not designed with gigabit WANs in mind

Page 10: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

10NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

Design Problems (1/2)

TCP’s congestion control algorithm (AIMD) is not suited to gigabit networks

Due to TCP’s limited feedback mechanisms, line errors are interpreted as congestion: Bandwidth utilization is reduced when it shouldn’t

RFC 2581 (which gives the formula for increasing cwnd) “forgot” delayed ACKs: Loss recovery time twice as long as it should be

Page 11: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

11NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

Design Problems (2/2)

TCP requires that ACKs be sent at most every second segment: Causes ACK bursts Bursts are difficult to handle by kernel and NIC

Page 12: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

12NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

AIMD (1/2)

Van Jacobson, SIGCOMM 1988 Congestion avoidance algorithm:

For each ACK in an RTT without loss, increase:

For each window experiencing loss, decrease:

Slow-start algorithm: Increase by one MSS per ACK until ssthresh

iii cwnd

cwndcwnd1

1

ii cwndcwnd 2

11

Page 13: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

13NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

AIMD (2/2)

Additive Increase: A TCP connection increases slowly its bandwidth

utilization in the absence of loss: forever, unless we run out of send/receive buffers or

detect a packet loss TCP is greedy: no attempt to reach a stationary state

Multiplicative Decrease: A TCP connection reduces its bandwidth utilization

drastically whenever a packet loss is detected: assumption: line errors are negligible, hence packet loss

means congestion

Page 14: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

14NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

Congestion Window (cwnd)

buffering beyondbuffering beyondbandwidth*delaybandwidth*delay

congestioncongestionavoidanceavoidance

slowslowstartstart

lossloss

Page 15: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

15NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

Disastrous Effect of Packet Losson TCP in Fast WANs (1/2)

AIMD C=1 Gbit/s MSS=1,460 Bytes

Page 16: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

16NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

Disastrous Effect of Packet Losson TCP in Fast WANs (2/2)

Long time to recover from a single loss: TCP should react to congestion rather than packet

loss: line errors and transient faults in equipment are no

longer negligible in fast WANs TCP should recover quicker from a loss

TCP is particularly sensitive to packet loss in fast WANs (i.e., when both cwnd and RTT are large)

Page 17: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

17NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

Characterization of the Problem (1/2)

The responsiveness measures how quickly we go back to using the network link at full capacity after experiencing a loss (i.e., loss recovery time if loss occurs when bandwidth utilization = network link capacity)

2 . inc2 . inc

TCP responsiveness

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0 50 100 150 200

RTT (ms)

Tim

e (

s) C= 622 Mbit/s

C= 2.5 Gbit/s

C= 10 Gbit/s

C . RTTC . RTT22

Page 18: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

18NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

Characterization of the Problem (2/2)

Capacity RTT # inc Responsiveness

9.6 kbit/s(typ. WAN in 1988)

max: 40 ms 1 0.6 ms

10 Mbit/s(typ. LAN in 1988)

max: 20 ms 8 ~150 ms

100 Mbit/s(typ. LAN in 2003)

max: 5 ms 20 ~100 ms

622 Mbit/s 120 ms ~2,900 ~6 min

2.5 Gbit/s 120 ms ~11,600 ~23 min

10 Gbit/s 120 ms ~46,200 ~1h 30min

inc size = MSS = 1,460 Bytes

Page 19: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

19NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

Congestion vs. Line Errors

ThroughputRequired Bit

Loss RateRequired Packet

Loss Rate

10 Mbit/s 2 10-8 2 10-4

100 Mbit/s 2 10-10 2 10-6

2.5 Gbit/s 3 10-13 3 10-9

10 Gbit/s 2 10-14 2 10-10

At gigabit speed, the loss rate required for packet loss to be ascribed only to congestion is unrealistic with AIMD

RTT=120 ms, MTU=1,500 Bytes, AIMD

Page 20: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

20NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

Single TCP Stream Performance under Periodic Losses

Effect of packet loss

0102030405060708090

100

0.000001 0.00001 0.0001 0.001 0.01 0.1 1 10Packet Loss frequency (%)

Ba

nd

wid

th U

tiliz

ati

on

(%

)

WAN (RTT=120ms)

LAN (RTT=0.04 ms)

Effect of packet loss

0100200300400500600700800900

1000

0.000001 0.00001 0.0001 0.001 0.01 0.1 1 10

Packet Loss frequency (%)

Th

rou

gh

pu

t (M

bit

/s)

Mathis SIGCOMM'97

WAN (RTT=120ms)

LAN (RTT=0.04 ms)

Loss rate =0.01%:Loss rate =0.01%:LAN BW LAN BW utilization= 99%utilization= 99%WAN BW WAN BW utilization=1.2%utilization=1.2%

MSS=1,460 Bytes

Page 21: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

21NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

Solutions

Page 22: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

22NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

What Can We Do?

To achieve higher throughputs over high bandwidth*delay networks, we can: Change AIMD algorithm Use larger MTUs Change the initial setting of ssthresh Avoid losses in end hosts

Two proposals: Kelly: Scalable TCP Ravot: GridDT

Page 23: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

23NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

Delayed ACKs with AIMD

RFC 2581 (spec. defining TCP congestion control AIMD algorithm) erred:

Implicit assumption: one ACK per packet In reality: one ACK every second packet with

delayed ACKs Responsiveness multiplied by two:

Makes a bad situation worse in fast WANs Problem fixed by ABC in RFC 3465 (Feb 2003)

Not implemented in Linux 2.4.21

iii cwnd

SMSSSMSScwndcwnd

1

Page 24: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

24NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

Delayed ACKs with AIMD and ABC

Page 25: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

25NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

Scalable TCP: Algorithm

For cwnd>lwnd, replace AIMD with new algorithm:

for each ACK in an RTT without loss: cwndi+1 = cwndi + a

for each window experiencing loss: cwndi+1 = cwndi – (b x cwndi)

Kelly’s proposal during internship at CERN:(lwnd,a,b) = (16, 0.01, 0.125)

Trade-off between fairness, stability, variance and convergence

Page 26: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

26NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

Scalable TCP: lwnd

Page 27: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

27NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

Scalable TCP: Advantages

Responsiveness is independent of capacity Responsiveness improves dramatically for

gigabit networks

Page 28: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

28NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

Scalable TCP: Responsiveness Independent of Capacity

Page 29: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

29NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

Scalable TCP:Improved Responsiveness

Responsiveness for RTT=200 ms and MSS=1,460 Bytes: Scalable TCP: ~3 s AIMD:

~3 min at 100 Mbit/s ~1h 10min at 2.5 Gbit/s ~4h 45min at 10 Gbit/s

Patch available for Linux kernel 2.4.19: http://www-lce.eng.cam.ac.uk/˜ctk21/scalable/

Page 30: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

30NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

Scalable TCP vs. AIMD:Benchmarking

Number of flows

2.4.19 TCP2.4.19 TCP + new dev

driver

Scalable TCP

1 7 16 44

2 14 39 93

4 27 60 135

8 47 86 140

16 66 106 142

Bulk throughput tests with C=2.5 Gbit/s. Flows transfer 2 GBytes and start again for 20 min.

Page 31: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

31NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

GridDT: Algorithm

Congestion avoidance algorithm: For each ACK in an RTT without loss,

increase:

By modifying A dynamically according to RTT, GridDT guarantees fairness among TCP connections:

iii cwnd

Acwndcwnd 1

2

2

1

2

1

A

A

RTT

RTT

A

A

Page 32: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

32NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

AIMD: RTT Bias

SunnyvaleSunnyvaleStarLightStarLight

CERNCERN

RR RRGE GE SwitchSwitch

Host #1Host #1

POS 2.5POS 2.5 Gbit/sGbit/s1 GE1 GE

1 GE1 GE

Host #2Host #2

Host #1Host #1

Host #2Host #2

1 GE1 GE

1 GE1 GE

BottleneckBottleneck

RRPOS 10POS 10 Gbit/sGbit/sRR10GE10GE

Two TCP streams share a 1 Gbit/s bottleneck CERN-Sunnyvale: RTT=181ms. Avg. throughput over a period of 7,000s = 202 Mbit/s CERN-StarLight: RTT=117ms. Avg. throughput over a period of 7,000s = 514 Mbit/s MTU = 9,000 Bytes. Link utilization = 72%

Throughput of two streams with different RTT sharing a 1Gbps bottleneck

0

100

200

300

400

500

600

700

800

900

1000

0 1000 2000 3000 4000 5000 6000 7000

Time (s)

Thr

ough

put

(Mbp

s) RTT=181ms

Average over the life of the connection RTT=181ms

RTT=117ms

Average over the life of the connection RTT=117ms

Page 33: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

33NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

Throughput of two streams with different RTT sharing a 1Gbps bottleneck

0

100

200

300

400

500

600

700

800

900

1000

0 1000 2000 3000 4000 5000 6000

Time (s)

Thr

ough

put (

Mbp

s)

A=7 ; RTT=181ms

Average over the life of the connection RTT=181ms

B=3 ; RTT=117ms

Average over the life of the connection RTT=117ms

GridDT Fairer than AIMD

CERN-Sunnyvale: RTT = 181 ms. Additive inc. A1 = 7. Avg. throughput = 330 Mbit/s CERN-StarLight: RTT = 117 ms. Additive inc. A2 = 3. Avg. throughput = 388 Mbit/s MTU = 9,000 Bytes. Link utilization 72%

39.2117

18122

2

1

A

A

RTT

RTT

33.23

7

2

1

A

A

SunnyvaleSunnyvale

StarLightStarLight

CERNCERN

RR RRGE GE SwitchSwitch

POS 2.5POS 2.5 Gbit/sGbit/s1 GE1 GE

1 GE1 GE

Host #2Host #2

Host #1Host #1

Host #2Host #2

1 GE1 GE

1 GE1 GEBottleneckBottleneck

RRPOS 10POS 10 Gbit/sGbit/sRR10GE10GE

Host #1Host #1

A2=3 RTT=117ms

A1=7 RTT=181ms

Page 34: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

34NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

Measurements with Different MTUs (1/2)

Mathis advocates the use of large MTUs: we tested standard Ethernet MTU and Jumbo frames

Experimental environment: Linux 2.4.21 SysKonnect device driver 6.12 Traffic generated by iperf

average throughout over the last 5 seconds Single TCP stream RTT = 119 ms Duration of each test: 2 hours Transfers from Chicago to Geneva

MTUs: POS MTU set to 9180 Max MTU on the NIC of a PC running Linux 2.4.21: 9000

Page 35: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

35NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

Measurements with Different MTUs (2/2)

TCP max: 990 Mbit/s (MTU=9000)TCP max: 990 Mbit/s (MTU=9000)TCP max: 940 Mbit/s (MTU=1500)TCP max: 940 Mbit/s (MTU=1500)

Page 36: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

36NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

Related Work

Floyd: High-Speed TCP Low: Fast TCP Katabi: XCP Web100 and Net100 projects PFLDnet 2003 workshop:

http://www.datatag.org/pfldnet2003/

Page 37: NORDUnet 2003, Reykjavik, Iceland, 26 August 2003 High-Performance Transport Protocols for Data-Intensive World-Wide Grids T. Kelly, University of Cambridge,

37NORDUnet 2003, Reykjavik, Iceland, 26 August 2003

Research Directions

Compare performance of TCP variants Investigate proposal by Shorten, Leith, Foy

and Kildu More stringent definition of congestion:

Lose more than 1 packet per RTT ACK more than two packets in one go:

Decrease ACK bursts SCTP vs. TCP