tcp transfers over high latency/bandwidth networks internet2 member meeting henp working group...

26
TCP transfers over high TCP transfers over high latency/bandwidth networks latency/bandwidth networks Internet2 Member Meeting Internet2 Member Meeting HENP working group session HENP working group session April 9-11, 2003, Arlington April 9-11, 2003, Arlington T. Kelly, University of Cambridge T. Kelly, University of Cambridge J.P. Martin-Flatin, O. Martin, CERN J.P. Martin-Flatin, O. Martin, CERN S. Low, Caltech S. Low, Caltech L. Cottrell, SLAC L. Cottrell, SLAC S. Ravot, Caltech S. Ravot, Caltech [email protected] [email protected]

Upload: archibald-lamb

Post on 18-Jan-2016

236 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University

TCP transfers over high TCP transfers over high latency/bandwidth networkslatency/bandwidth networks

Internet2 Member MeetingInternet2 Member MeetingHENP working group sessionHENP working group session

April 9-11, 2003, Arlington April 9-11, 2003, Arlington

T. Kelly, University of CambridgeT. Kelly, University of CambridgeJ.P. Martin-Flatin, O. Martin, CERNJ.P. Martin-Flatin, O. Martin, CERN

S. Low, CaltechS. Low, CaltechL. Cottrell, SLACL. Cottrell, SLACS. Ravot, CaltechS. Ravot, Caltech

[email protected]@hep.caltech.edu

Page 2: TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University

ContextContext

High Energy Physics (HEP)High Energy Physics (HEP) LHC model shows data at the experiment will be stored at the rate of 100 – 1500 Mbytes/sec LHC model shows data at the experiment will be stored at the rate of 100 – 1500 Mbytes/sec

throughout the year.throughout the year. Many Petabytes per year of stored and processed binary data will be accessed and Many Petabytes per year of stored and processed binary data will be accessed and

processed repeatedly by the worldwide collaborations.processed repeatedly by the worldwide collaborations.

New backbone capacities advancing rapidly to 10 Gbps rangeNew backbone capacities advancing rapidly to 10 Gbps range

TCP limitationTCP limitation Additive increase and multiplicative decrease policyAdditive increase and multiplicative decrease policy

TCP FairnessTCP Fairness Effect of the MTUEffect of the MTU Effect of the RTTEffect of the RTT

New TCP implementationsNew TCP implementations Grid DTGrid DT Scalable TCPScalable TCP Fast TCPFast TCP High-speed TCPHigh-speed TCP

Internet2 Land Speed recordInternet2 Land Speed record

Page 3: TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University

Time to recover from a single lossTime to recover from a single loss

TCP Throughput CERN-Chicago over the 622 Mbit/s link

0

50

100

150

200

0 200 400 600 800 1000 1200 1400 1600

Time (s)

Th

rou

gh

pu

t (M

bit

/s)

TCP reactivity TCP reactivity Time to increase the throughput by 120 Mbit/s is larger than 6 min Time to increase the throughput by 120 Mbit/s is larger than 6 min

for a connection between Chicago and CERN.for a connection between Chicago and CERN. A single loss is disastrousA single loss is disastrous

A TCP connection reduces its bandwidth use by half after a loss is A TCP connection reduces its bandwidth use by half after a loss is detected (Multiplicative decrease)detected (Multiplicative decrease)

A TCP connection increases slowly its bandwidth use (Additive A TCP connection increases slowly its bandwidth use (Additive increase)increase)

TCP throughput is much more sensitive to packet loss in WANs TCP throughput is much more sensitive to packet loss in WANs than in LANsthan in LANs

6 min6 min

Page 4: TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University

Responsiveness (I)Responsiveness (I)

The responsiveness The responsiveness measures how quickly we go back to measures how quickly we go back to using the network link at full capacity after experiencing a loss if using the network link at full capacity after experiencing a loss if we assume that the congestion window size is equal to the we assume that the congestion window size is equal to the Bandwidth Delay product when the packet is lost.Bandwidth Delay product when the packet is lost.

C . RTTC . RTT

2 . MSS2 . MSS

22 C : Capacity of the linkC : Capacity of the link

TCP responsiveness

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0 50 100 150 200

RTT (ms)

Tim

e (

s) C= 622 Mbit/s

C= 2.5 Gbit/s

C= 10 Gbit/s

Page 5: TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University

Responsiveness (II)Responsiveness (II)

CaseCase CC RTT (ms)RTT (ms) MSS (Byte)MSS (Byte) ResponsivenessResponsiveness

Typical LAN in 1988Typical LAN in 1988 10 Mb/s10 Mb/s [ 2 ; 20 ] [ 2 ; 20 ] 14601460 [ 1.7 ms ; 171 ms ][ 1.7 ms ; 171 ms ]

Typical LAN todayTypical LAN today 1 Gb/s1 Gb/s 22(worst case)(worst case)

14601460 96 ms96 ms

Futur LANFutur LAN 10 Gb/s10 Gb/s 22(worst case)(worst case)

14601460 1.7s1.7s

WAN WAN Geneva <-> ChicagoGeneva <-> Chicago

1 Gb/s1 Gb/s 120120 14601460 10 min10 min

WAN WAN Geneva <-> SunnyvaleGeneva <-> Sunnyvale

1 Gb/s1 Gb/s 180180 14601460 23 min23 min

WAN WAN Geneva <-> TokyoGeneva <-> Tokyo

1 Gb/s1 Gb/s 300300 14601460 1 h 04 min1 h 04 min

WAN WAN Geneva <-> SunnyvaleGeneva <-> Sunnyvale

2.5 Gb/s2.5 Gb/s 180180 14601460 58 min58 min

Future WAN Future WAN CERN <-> StarlightCERN <-> Starlight

10 Gb/s10 Gb/s 120120 14601460 1 h 32 min1 h 32 min

Future WAN link Future WAN link CERN <-> StarlightCERN <-> Starlight

10 Gb/s10 Gb/s 120120 8960 8960 (Jumbo Frame)(Jumbo Frame)

15 min15 min

The Linux kernel 2.4.x implements delayed acknowledgment. Due to delayed acknowledgments, the The Linux kernel 2.4.x implements delayed acknowledgment. Due to delayed acknowledgments, the

responsiveness is multiplied by two. Therefore, values above have to be multiplied by tworesponsiveness is multiplied by two. Therefore, values above have to be multiplied by two!!

Page 6: TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University

Effect of the MTU on the Effect of the MTU on the responsivenessresponsiveness

0

200

400

600

800

1000

0 1000 2000 3000

Time (s)

Thr

ough

put

(Mb/

s)

MTU=1498

MTU=3998

MTU=8988

Effect of the MTU on a transfer between CERN and Starlight (RTT=117 ms, bandwidth=1 Gb/s)Effect of the MTU on a transfer between CERN and Starlight (RTT=117 ms, bandwidth=1 Gb/s)

Larger MTU improves the TCP responsiveness because you increase your cwnd by one MSS Larger MTU improves the TCP responsiveness because you increase your cwnd by one MSS each RTT.each RTT.

Couldn’t reach wire-speed with standard MTUCouldn’t reach wire-speed with standard MTU Larger MTU reduces overhead per frames (saves CPU cycles, reduces the number of Larger MTU reduces overhead per frames (saves CPU cycles, reduces the number of

packets) packets)

Page 7: TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University

Starlight (Chi)Starlight (Chi)CERN (GVA)CERN (GVA)

MTU and FairnessMTU and Fairness

Two TCP streams share a 1 Gb/s bottleneck Two TCP streams share a 1 Gb/s bottleneck RTT=117 msRTT=117 ms MTU = 3000 Bytes ; Avg. throughput over a period of 7000s = 243 Mb/sMTU = 3000 Bytes ; Avg. throughput over a period of 7000s = 243 Mb/s MTU = 9000 Bytes; Avg. throughput over a period of 7000s = 464 Mb/sMTU = 9000 Bytes; Avg. throughput over a period of 7000s = 464 Mb/s Link utilization : 70,7 %Link utilization : 70,7 %

RR RRGbE GbE SwitchSwitch

Host #1Host #1POS 2.5POS 2.5 GbpsGbps1 GE1 GE

1 GE1 GE

Host #2Host #2

Host #1Host #1

Host #2Host #2

1 GE1 GE

1 GE1 GE

BottleneckBottleneck

Throughput of two streams with different MTU sizes sharing a 1 Gbps bottleneck

0

100

200

300

400

500

600

700

800

900

1000

0 1000 2000 3000 4000 5000 6000

Time(s)

Thr

ough

put

(Mbp

s)

MTU = 3000 Byte

Average over the life of the connection MTU = 3000 Byte

MTU = 9000 Byte

Average over the life of the connection MTU = 9000 Byte

Page 8: TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University

SunnyvaleSunnyvaleStarlight (Chi)Starlight (Chi)

CERN (GVA)CERN (GVA)

RTT and FairnessRTT and Fairness

RR RRGbE GbE SwitchSwitch

Host #1Host #1

POS 2.5POS 2.5 Gb/sGb/s1 GE1 GE

1 GE1 GE

Host #2Host #2

Host #1Host #1

Host #2Host #2

1 GE1 GE

1 GE1 GE

BottleneckBottleneck

RRPOS 10POS 10 Gb/sGb/sRR10GE10GE

Two TCP streams share a 1 Gb/s bottleneck Two TCP streams share a 1 Gb/s bottleneck CERN <-> Sunnyvale RTT=181ms ; Avg. throughput over a period of 7000s = 202Mb/sCERN <-> Sunnyvale RTT=181ms ; Avg. throughput over a period of 7000s = 202Mb/s CERN <-> Starlight RTT=117ms; Avg. throughput over a period of 7000s = 514Mb/sCERN <-> Starlight RTT=117ms; Avg. throughput over a period of 7000s = 514Mb/s MTU = 9000 bytesMTU = 9000 bytes Link utilization = 71,6 % Link utilization = 71,6 %

Throughput of two streams with different RTT sharing a 1Gbps bottleneck

0

100

200

300

400

500

600

700

800

900

1000

0 1000 2000 3000 4000 5000 6000 7000

Time (s)

Thr

ough

put

(Mbp

s) RTT=181ms

Average over the life of the connection RTT=181ms

RTT=117ms

Average over the life of the connection RTT=117ms

Page 9: TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University

RTT of a TCP connection between CERN and Starlight

0

20

40

60

80

100

120

140

160

0 500 1000 1500Time(s)

RT

T (

ms)

Throughput of a TCP connection between CERN and Starlight

0

200

400

600

800

1000

0 500 1000 1500Time(s)

Thr

ough

put(

Mb/

s)

Starlight (Chi)Starlight (Chi)CERN (GVA)CERN (GVA)

Effect of buffering on End-hostsEffect of buffering on End-hosts

SetupSetup

RTT = 117 msRTT = 117 ms Jumbo FramesJumbo Frames Transmit queue of the network Transmit queue of the network

device = 100 packets (i.e 900 device = 100 packets (i.e 900 kBytes)kBytes)

Area #1Area #1 Cwnd < BDP =>Cwnd < BDP =>

Throughput < BandwidthThroughput < Bandwidth RTT constantRTT constant Throughput = Cwnd / RTTThroughput = Cwnd / RTT

Area #2Area #2 Cwnd > BDP => Cwnd > BDP =>

Throughput = BandwidthThroughput = Bandwidth RTT increase (proportional to Cwnd)RTT increase (proportional to Cwnd)

Link utilization larger than 75%Link utilization larger than 75%

Area #2Area #2Area #1Area #1

RR RRHost Host GVAGVA

Host Host CHICHI

POS 2.5POS 2.5 Gb/sGb/s1 GE1 GE 1 GE1 GE

Page 10: TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University

Buffering space on End-hostsBuffering space on End-hosts

Link utilization near 100% if :Link utilization near 100% if : No congestion into the networkNo congestion into the network No transmission errorNo transmission error Buffering space = Bandwidth delay productBuffering space = Bandwidth delay product TCP buffers size = 2 * Bandwidth delay productTCP buffers size = 2 * Bandwidth delay product=> Congestion window size always larger than the bandwidth delay product=> Congestion window size always larger than the bandwidth delay product

Effect of the buffering on the throughput

0

200

400

600

800

1000

0 200 400 600 800 1000 1200 1400 1600 1800

Time(s)

Thr

ough

put

(Mb/

s)

txqueulen=100 pkts

txqueulen=500 pkts

txqueulen=1000 pkts

txqueulen=1500 pkts

Effect of the buffering on the RTT

0

50

100

150

200

250

0 200 400 600 800 1000 1200 1400 1600 1800Time(s)

RT

T (

ms)

txqueuelen=100 pkts

txqueuelen=500 pkts

txqueuelen=1000 pkts

txqueuelen=1500 pkts

Txqueulen Txqueulen is the is the transmit queue of transmit queue of the network the network devicedevice

Page 11: TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University

Linux Patch “GRID DT”Linux Patch “GRID DT”

Parameter tuningParameter tuningNew parameter to better start a TCP transferNew parameter to better start a TCP transfer

Set the value of the initial SSTHRESHSet the value of the initial SSTHRESH

Modifications of the TCP algorithms (RFC 2001)Modifications of the TCP algorithms (RFC 2001)Modification of the well-know Modification of the well-know congestion avoidancecongestion avoidance algorithm algorithm

During congestion avoidance, for every acknowledgement During congestion avoidance, for every acknowledgement received, cwnd increases by received, cwnd increases by A * (segment size) * (segment size) / cwnd.A * (segment size) * (segment size) / cwnd.It’s equivalent to increase cwnd by A segments each RTT. A is It’s equivalent to increase cwnd by A segments each RTT. A is called additive incrementcalled additive increment

Modification of the slow start algorithmModification of the slow start algorithm During slow start, for every acknowledgement received, cwnd During slow start, for every acknowledgement received, cwnd

increases by M segments. M is called multiplicative increment.increases by M segments. M is called multiplicative increment.

Note: A=1 and M=1 in TCP RENO.Note: A=1 and M=1 in TCP RENO. Smaller backoff Smaller backoff

Reduce the strong penalty imposed by a lossReduce the strong penalty imposed by a loss

Page 12: TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University

Grid DTGrid DT

Only the sender’s TCP stack has to be modifiedOnly the sender’s TCP stack has to be modified Very simple modifications to the TCP/IP stackVery simple modifications to the TCP/IP stack Alternative to Multi-streams TCP transfersAlternative to Multi-streams TCP transfers

Single stream vs Multi streamsSingle stream vs Multi streams it is simpler it is simpler startup/shutdown are fasterstartup/shutdown are faster fewer keys to manage (if it is secure)fewer keys to manage (if it is secure)

Virtual increase of the MTU. Virtual increase of the MTU. Compensate the effect of delayed ackCompensate the effect of delayed ack Can improve “fairness” Can improve “fairness”

between flows with different RTTbetween flows with different RTT between flows with different MTUbetween flows with different MTU

Page 13: TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University

Effect of the RTT on the fairnessEffect of the RTT on the fairness Objective: Improve fairness between two TCP streams with different RTT and same MTUObjective: Improve fairness between two TCP streams with different RTT and same MTU We can adapt the model proposed by Matt. Mathis by taking into account a higher additive incrementWe can adapt the model proposed by Matt. Mathis by taking into account a higher additive increment Assumptions:Assumptions:

Approximate the packet loss of probability Approximate the packet loss of probability pp by assuming that each flow delivers by assuming that each flow delivers 1/p1/p consecutive consecutive packets followed by one drop.packets followed by one drop.

Under these assumptions, the congestion window of the flows oscillate with a period T0. Under these assumptions, the congestion window of the flows oscillate with a period T0. If the receiver acknowledges every packet, then the congestion window size opens by x (additive If the receiver acknowledges every packet, then the congestion window size opens by x (additive

increment) packets each RTT.increment) packets each RTT.

W

W/2

(t)2T0T0

00

00___

TTtdtAdttWAStreampacketsNb

00

00'''''___

TTdttBdttWBStreampacketsNb

B

A

RTT

RTT

dt

dt'

2

B

A

RTT

RTT

B

ABy modifying the congestion increment dynamically according By modifying the congestion increment dynamically according to RTT, guarantee fairness among TCP connections:to RTT, guarantee fairness among TCP connections:

Relation between t and t’:Relation between t and t’:

Number of packets delivered by each stream in one period:Number of packets delivered by each stream in one period:

CWND evolution under periodic loss CWND evolution under periodic loss

incrementsadditivetheareBandA

Page 14: TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University

Throughput of two streams with different RTT sharing a 1Gbps bottleneck

0

100

200

300

400

500

600

700

800

900

1000

0 1000 2000 3000 4000 5000 6000

Time (s)

Thr

ough

put (

Mbp

s)

A=7 ; RTT=181ms

Average over the life of the connection RTT=181ms

B=3 ; RTT=117ms

Average over the life of the connection RTT=117ms

Effect of the RTT on the fairnessEffect of the RTT on the fairness

TCP Reno performance (see slide #8):TCP Reno performance (see slide #8): First stream GVA <-> Sunnyvale : RTT = 181 ms ; Avg. throughput over a period of 7000s = 202 Mb/sFirst stream GVA <-> Sunnyvale : RTT = 181 ms ; Avg. throughput over a period of 7000s = 202 Mb/s Second stream GVA<->CHI : RTT = 117 ms; Avg. throughput over a period of 7000s = 514 Mb/sSecond stream GVA<->CHI : RTT = 117 ms; Avg. throughput over a period of 7000s = 514 Mb/s Links utilization 71,6%Links utilization 71,6%

Grid DT tuning in order to improve fairness between two TCP streams with different RTT:Grid DT tuning in order to improve fairness between two TCP streams with different RTT:First stream GVA <-> Sunnyvale : RTT = 181 ms, Additive increment = A = 7 ; Average throughput = 330 Mb/sFirst stream GVA <-> Sunnyvale : RTT = 181 ms, Additive increment = A = 7 ; Average throughput = 330 Mb/sSecond stream GVA<->CHI : RTT = 117 ms, Additive increment = B = 3 ; Average throughput = 388 Mb/sSecond stream GVA<->CHI : RTT = 117 ms, Additive increment = B = 3 ; Average throughput = 388 Mb/sLinks utilization 71.8%Links utilization 71.8%

39.2117

18122

B

A

RTT

RTT

33.23

7 B

A

SunnyvaleSunnyvale

Starlight (CHI)Starlight (CHI)

CERN (GVA)CERN (GVA)

RR RRGbE GbE SwitchSwitch

POS 2.5POS 2.5 Gb/sGb/s1 GE1 GE

1 GE1 GE

Host #2Host #2

Host #1Host #1

Host #2Host #2

1 GE1 GE

1 GE1 GEBottleneckBottleneck

RRPOS 10POS 10 Gb/sGb/sRR10GE10GE

Host #1Host #1

Page 15: TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University

Effect of the MTUEffect of the MTU

Throughput of two streams with different MTU sizes sharing a Gbps bottleneck

0

100

200

300

400

500

600

700

800

900

1000

0 1000 2000 3000 4000 5000

Time (s)

Th

rou

gh

pu

t (M

bp

s)

MTU = 3000 Byte

Average over the life of the connection MTU = 3000 Byte

MTU = 9000 Byte

Average over the life of the connection MTU = 9000 Byte

Starlight (Chi)Starlight (Chi)CERN (GVA)CERN (GVA)

Two TCP streams share a 1 Gb/s bottleneck Two TCP streams share a 1 Gb/s bottleneck RTT=117 msRTT=117 ms MTU = 3000 Bytes ; Additive increment = 3; Avg. throughput over a period of 6000s = 310 Mb/sMTU = 3000 Bytes ; Additive increment = 3; Avg. throughput over a period of 6000s = 310 Mb/s MTU = 9000 Bytes; Additive increment = 1; Avg. throughput over a period of 6000s = 325 Mb/sMTU = 9000 Bytes; Additive increment = 1; Avg. throughput over a period of 6000s = 325 Mb/s Link utilization : 61,5 %Link utilization : 61,5 %

RR RRGbE GbE SwitchSwitch

Host #1Host #1POS 2.5POS 2.5 Gb/sGb/s1 GE1 GE

1 GE1 GE

Host #2Host #2

Host #1Host #1

Host #2Host #2

1 GE1 GE

1 GE1 GE

BottleneckBottleneck

Page 16: TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University

Next WorkNext Work Taking into account the value of the MTU in the evaluation of the additive increment:Taking into account the value of the MTU in the evaluation of the additive increment:

Define a reference:Define a reference:

For example:For example: Reference: MTU = 9000 bytes => Add. Increment = 1Reference: MTU = 9000 bytes => Add. Increment = 1 MTU = 1500 bytes => Add. Increment = 6 MTU = 1500 bytes => Add. Increment = 6 MTU = 3000 bytes => Add. Increment = 3 MTU = 3000 bytes => Add. Increment = 3

Taking into account the square of the RTT in the evaluation of the additive increment:Taking into account the square of the RTT in the evaluation of the additive increment: Define a reference: Define a reference:

For example:For example: Reference: RTT=10 ms => Add. Increment = 1Reference: RTT=10 ms => Add. Increment = 1 RTT=100ms => Add. Increment = 100RTT=100ms => Add. Increment = 100 RTT=200ms => Add. Increment = 400RTT=200ms => Add. Increment = 400

Combining the two formulas above:Combining the two formulas above:

Periodic evaluation of the RTT and the MTU.Periodic evaluation of the RTT and the MTU. How to define the references?How to define the references?

REF

REF

REF

RTTRTTifRTT

RTT

RTTRTTif

RTTA 2

1

)(

REF

REF

REF

MTUMTUifMTU

MTU

MTUMTUif

MTUA

1

)(

MTURTTfIncrementAdditive ,2

Page 17: TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University

Scalable TCPScalable TCP

For cwnd>lwnd, replace AIMD with new algorithm:For cwnd>lwnd, replace AIMD with new algorithm:for each ACK in an RTT without loss:for each ACK in an RTT without loss:

cwndcwndi+1i+1 = cwnd = cwndii + a + afor each window experiencing loss:for each window experiencing loss:

cwndcwndi+1i+1 = cwnd = cwndii – (b x cwnd – (b x cwndii))

Kelly’s proposal during internship at CERN:Kelly’s proposal during internship at CERN:(lwnd,a,b) = (16, 0.01, 0.125)(lwnd,a,b) = (16, 0.01, 0.125)Trade-off between fairness, stability, variance and convergenceTrade-off between fairness, stability, variance and convergence

Advantages:Advantages:Responsiveness improves dramatically for gigabit networksResponsiveness improves dramatically for gigabit networksResponsiveness is independent of capacityResponsiveness is independent of capacity

Page 18: TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University

Scalable TCP: Responsiveness Scalable TCP: Responsiveness Independent of CapacityIndependent of Capacity

Page 19: TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University

Scalable TCP vs. TCP NewReno:Scalable TCP vs. TCP NewReno:BenchmarkingBenchmarking

Number of Number of flowsflows 2.4.19 TCP2.4.19 TCP

2.4.19 TCP + 2.4.19 TCP + new dev new dev

driverdriverScalable TCPScalable TCP

1 7 16 44

2 14 39 93

4 27 60 135

8 47 86 140

16 66 106 142

Responsiveness for RTT=200 ms and MSS=1460 bytes:Responsiveness for RTT=200 ms and MSS=1460 bytes: Scalable TCP: 2.7 sScalable TCP: 2.7 s TCP NewReno (AIMD):TCP NewReno (AIMD):

~3 min at 100 Mbit/s~3 min at 100 Mbit/s ~1h 10min at 2.5 Gbit/s~1h 10min at 2.5 Gbit/s ~4h 45min at 10 Gbit/s~4h 45min at 10 Gbit/s

BulkBulk throughput tests with C=2.5 Gbit/s. Flows transfer 2 Gbytes and throughput tests with C=2.5 Gbit/s. Flows transfer 2 Gbytes and start again for 1200sstart again for 1200s

For details, see paper and code at:For details, see paper and code at: http://www-lce.eng.cam.ac.uk/˜ctk21/scalable/http://www-lce.eng.cam.ac.uk/˜ctk21/scalable/

Page 20: TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University

Fast TCPFast TCP

Equilibrium propertiesEquilibrium propertiesUses end-to-end delay Uses end-to-end delay andand loss lossAchieves any desired fairness, expressed by utility functionAchieves any desired fairness, expressed by utility functionVery high utilization (99% in theory)Very high utilization (99% in theory)

Stability propertiesStability propertiesStability for arbitrary delay, capacity, routing & loadStability for arbitrary delay, capacity, routing & loadRobust to heterogeneity, evolution, …Robust to heterogeneity, evolution, …Good performanceGood performance

Negligible queueing delay & loss (with ECN)Negligible queueing delay & loss (with ECN) Fast responseFast response

Page 21: TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University

FAST TCP performance FAST TCP performance

Page 22: TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University

FAST TCP performanceFAST TCP performance

1 flow 2 flows 7 flows 9 flows 10 flows

Average utilization

95%

92%

90%

90%FASTFAST Standard MTUStandard MTU Utilization averaged over > 1hrUtilization averaged over > 1hr

1hr 1hr 6hr 1.1hr 6hr

88%

Page 23: TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University

FAST TCP performanceFAST TCP performance

Linux TCP Linux TCP FAST

Average utilization

19%

27%

92%FASTFAST Standard MTUStandard MTU Utilization averaged over 1hrUtilization averaged over 1hr

txq=100 txq=10000

95%

16%

48%

Linux TCP Linux TCP FAST

2G

1G

Page 24: TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University

Internet2 Land Speed recordInternet2 Land Speed record

On February 27-28, 2003, over a Terabyte of data was On February 27-28, 2003, over a Terabyte of data was transferred in less than an hour between the Level(3) transferred in less than an hour between the Level(3) Gateway in Sunnyvale, near SLAC, and CERN. Gateway in Sunnyvale, near SLAC, and CERN.

The data passed through the TeraGrid Router at StarLight The data passed through the TeraGrid Router at StarLight from memory to memory as a single TCP/IP stream at an from memory to memory as a single TCP/IP stream at an average rate of 2.38 Gbits/s (using large windows and average rate of 2.38 Gbits/s (using large windows and 9KByte "jumbo frames"). 9KByte "jumbo frames").

This beat the former record by a factor of approximately This beat the former record by a factor of approximately 2.5 and used the US-CERN link at 99% efficiency2.5 and used the US-CERN link at 99% efficiency

Page 25: TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University

Internet2 LSR tesbedInternet2 LSR tesbed

Page 26: TCP transfers over high latency/bandwidth networks Internet2 Member Meeting HENP working group session April 9-11, 2003, Arlington T. Kelly, University

ConclusionConclusion

To achieve high throughput over high latency/bandwidth To achieve high throughput over high latency/bandwidth network, we need to :network, we need to : Set the initial slow start threshold (ssthresh) to an Set the initial slow start threshold (ssthresh) to an

appropriate value for the delay and bandwidth of the appropriate value for the delay and bandwidth of the link.link.

Avoid loss Avoid loss By limiting the max cwnd sizeBy limiting the max cwnd size

Recover fast if loss occurs:Recover fast if loss occurs:Larger cwnd increment Larger cwnd increment Smaller window reduction after a lossSmaller window reduction after a lossLarger packet size (Jumbo Frame)Larger packet size (Jumbo Frame)

Is standard MTU the largest bottleneck?Is standard MTU the largest bottleneck? How to define the fairness?How to define the fairness?

Taking into account the MTUTaking into account the MTU Taking into account the RTTTaking into account the RTT

Which is the best new TCP implementation?Which is the best new TCP implementation?