attaining the promise and avoiding the pitfalls of tcp in ... › system › files › conference...

This paper is included in the Proceedings of the 12th USENIX Symposium on Networked Systems

Design and Implementation (NSDI ’15).May 4–6, 2015 • Oakland, CA, USA

ISBN 978-1-931971-218

Open Access to the Proceedings of the 12th USENIX Symposium on

Networked Systems Design and Implementation (NSDI ’15)

is sponsored by USENIX

Attaining the Promise and Avoiding the Pitfalls of TCP in the Datacenter

Glenn Judd, Morgan Stanley

https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/judd

USENIX Association 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’15) 145

Attaining the Promise and Avoiding the Pitfalls of TCP in the Datacenter

Glenn JuddMorgan Stanley

AbstractOver the last several years, datacenter computing hasbecome a pervasive part of the computing landscape.In spite of the success of the datacenter computingparadigm, there are significant challenges remaining tobe solved—particularly in the area of networking. Thesuccess of TCP/IP in the Internet makes TCP/IP a natu-ral candidate for datacenter network communication. Agrowing body of research and operational experience,however, has found that TCP often performs poorly indatacenter settings. TCP’s poor performance has ledsome groups to abandon TCP entirely in the datacen-ter. This is not desirable, however, as it requires recon-struction of a new transport protocol as well as rewritingapplications to use the new protocol. Over the last fewyears, promising research has focused on adapting TCPto operate in the datacenter environment.

We have been running large datacenter computationsfor several years, and have experienced the promises andthe pitfalls that datacenter computation presents. In thispaper, we discuss our experiences with network commu-nication performance within our datacenter, and discusshow we have leveraged and extended recent research tosignificantly improve network performance within ourdatacenter.

1 IntroductionIn recent years, datacenter computing has become a per-vasive part of the computing landscape. The most visibleexamples of datacenter computing are the warehouse-scale computers [4] used to run search engines, socialnetworks, and other publicly visible “cloud” applica-tions. Less visible, but no less critical, are datacentercomputing platforms used internally by numerous orga-nizations.

In spite of the success of the datacenter computingparadigm, there are significant challenges remaining tobe solved—particularly in the area of networking. Thepervasiveness of TCP/IP in the Internet makes TCP/IP anatural candidate for datacenter network communication.TCP/IP, however, was not designed for the datacenterenvironment, and many TCP design assumptions—e.g.a high degree of flow multiplexing, multi-millisecondRTT—do not hold in a datacenter. A growing body ofresearch and operational experience, has found that TCPcan perform poorly in datacenter settings.

TCP’s poor performance has led some groups to aban-don TCP entirely [15]. This is not desirable, however,as it requires reconstruction of a new transport protocolas well as rewriting applications to use the new protocol.Recent research has focused on adapting TCP to operatein the datacenter environment. DCTCP stands out as aparticularly promising approach as it utilizes technologyavailable today to dramatically improve datacenter TCPperformance.

In this paper, we discuss our experiences with net-work communication performance within our datacenterand discuss how we have leveraged and extended recentresearch to significantly improve network performancewithin our datacenter, without requiring changes to ourapplications.

The experimental results that we present are often inthe form of controlled tests that isolate behavior that weencountered either in actual production TCP and DCTCPusage, or in our efforts to introduce DCTCP into produc-tion.

In addition, this paper makes the following specificcontributions.

• To the best of our knowledge, this paper presentsthe first published discussion of DCTCP productiondeployment.

• We identify shortcomings that make DCTCP as pre-sented and implemented in [1] unusable in our en-vironment, and we present solutions to those short-comings that we have verified through implementa-tion.

• We demonstrate that commonly used receive buffertuning algorithms perform poorly in current data-centers.

• We empirically compare DCTCP performanceto TCP convergence, and we show that—surprisingly—DCTCP convergence can be superi-or to TCP convergence. We show that this is dueto DCTCP’s superior coexistence with common re-ceive buffer tuning algorithms. With correct buffertuning, TCP convergence, stability, and short-termfairness all exceed that of DCTCP.

• We also discuss results from dramatically reducingRTOmin at scale to mitigate incast.

146 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’15) USENIX Association

Our discussion will proceed as follows. Section 2 willbriefly describe our datacenter environment. Section 3will discuss the three significant problems that we haveencountered with TCP in our datacenter. Section 4 willdiscuss problems that delayed acknowledgements intro-duce into datacenter networks, and will analyze solu-tions. Section 5 will discuss reducing RTOmin to mitigateincast. Section 6 will discuss addressing the root cause ofincast-induced packet loss using DCTCP. Section 7 willdiscuss obstacles that prevent DCTCP from being used inour environment, and solutions for those problems. Sec-tion 8 will then compare DCTCP performance to that ofTCP. Section 9 will investigate the performance of auto-matic TCP buffer tuning in our environment. Section 10will briefly discuss related work, before we conclude inSection 11.

2 Setting

The majority of recent work on TCP in the datacenterhas either implicitly or explicitly been undertaken in thecontext of an Internet services setting. Of course, data-center computation applies to a much broader spectrumof applications, and even within a single datacenter of asingle organization, a wide variety of application typesmay be found.

Figure 1: Typical Application Structure

2.1 Overview

The context of this work is a datacenter used large-ly for two broad types of applications: Monte Carlosimulation and data analysis. A typical application isstructured as shown in Figure 1, which shows a com-mon communication-intensive application structure in

our datacenter. This application is constructed as a se-ries of transformations (depicted as rectangles) on data(depicted as ovals and rounded rectangles.) Each trans-formation may read several data elements, and may storeseveral data elements.

Most data access is to one of two highly-parallel dis-tributed data storage systems: a key-value store and adistributed file system. The key-value store tends to gen-erate much higher degrees of incast (discussed at lengthfurther in this paper) due to support for bulk reading andwriting of values. The distributed file system results inmore limited incast as the number of blocks simultane-ously read by any particular operation is limited by thefile system’s read-ahead limit. Further details of thesestorage systems are outside the scope of this paper, butboth are colocated with our computation servers and—thus—are highly parallel.

Monte Carlo simulations tend to be computationallyintensive, but even they tend to contain periods of inten-sive communication. Data analysis applications tend tobe storage and communication intensive.

As our datacenter is shared among many applicationsand distinct user groups, it is very important that applica-tions in our datacenter are as loosely coupled as possible.

Unless otherwise specified, the applications discussedand results presented in this paper were obtained on a 10Gbps network with a 9K MTU. Also unless otherwisespecified, controlled experiments were conducted usingiperf as traffic generator sending at the maximum rate al-lowed. TCP congestion control is CUBIC [6] unless oth-erwise stated (as CUBIC is the Linux default congestioncontrol.) We conducted several of the controlled experi-ments using the Linux Reno implementation, but did notobserve any significant differences. As such we have leftcomparisons with Reno (and other TCP variants) as outof scope for this work. Applications in this datacenter donot access the public Internet.

2.2 Traffic Characteristics

To illustrate the type of traffic that our applications gen-erate, we recorded network traffic for a two-minute inter-val of a representative application (a Monte Carlo simu-lation) on a single server in this application. Due to theuniform nature of both our applications and our storagesystems, the traffic seen by other servers is very similar.(We have verified this with additional samples on otherservers.) Figures 2, 3, and 4 summarize flow character-istics of the recorded traffic.

TCP connections in our environment tend to be long-lived. For the purposes of this analysis, we define a flowas packets demarcated by TCP PUSH flags within a sin-gle TCP connection.

2


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 10 100 1,000 10,000 100,000 1,000,000 10,000,000

Figure 2: CDF of flow sizes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 10 100 1,000 10,000 100,000 1,000,000 10,000,000

Figure 3: CDF of flow bytes

Figure 2 depicts the cumulative distribution of flowsizes sampled, and illustrates that the vast majority offlows in this application are very small. These smallflows largely consist of either data retrieval requests, orsimple operation results. (As stated earlier, an individualconnection will contain many flows. These flows oftenoccur in quick succession within the connection.)

Figure 3 illustrates the cumulative distribution of flowbytes. (For each flow size, the corresponding point de-picts the fraction of total bytes in flows less than or equalto that flow size.) As shown in Figure 3, the majority ofbytes are in larger flows—in spite of the large number ofsmall flows. This is due to the fact that while the sim-ple requests and operation results dominate in terms offlow numbers, most bytes on the network are generatedby actual value storage or retrieval.

In addition, we also categorized the sampled traffic asshown in Figure 4. This figure shows the fraction of totaltraffic (measured in bytes) that falls into the given traf-fic categories. This figure clearly shows that key-valuestore traffic dominates, followed by distributed file sys-tem traffic. Other traffic types are not a significant frac-tion of the total traffic.

Other

Distributed File System

Key-value Store

Figure 4: Flow Type Categorization

In summary, key-value store traffic dominates the traf-fic in the measured application. Most flows generated bythis application are very small—too small for tradition-al congestion control to prevent problems such as incast.The majority of traffic, however, is in contained in largerflows. Thus, congestion control plays an important rolein preventing larger flows from experiencing congestion,and in preserving buffer space for small flows.

3 TCP in the Datacenter

Communication intensive datacenter applicationspresent datacenter networks with several performanceproblems [5]. This is largely due to the fact that TCP—the foundation of many datacenter applications—wasnot originally designed with the characteristics ofmodern datacenters in mind. In this section we discussthree significant problems with TCP that we haveencountered: delayed ACK induced stalls, incast, andproblems with receive buffer tuning.

3.1 Stalls Due to Delayed ACKs

Delayed ACKS in TCP allow TCP to substantially re-duce the number of packets sent. Delayed ACKs work bydelaying the sending of an ACK for multiple segments.The delayed ACK effectively merges ACKs by cumula-tively acknowledging multiple received segments.

Delayed ACKs have an associated timeout to preventthe sender from stalling forever due to a lack of ACKsfrom the receiver. The default timeout is tens to hundredsof milliseconds. In a datacenter with sub-millisecondRTT, the default delayed ACK timeout is far too large,and we have observed application-level timeouts thatwere caused by delayed ACKs. Section 4 will discussresolving this issue.

3


3.2 IncastThe most vexing problem that TCP encounters in our dat-acenter network is “TCP incast” [12]. TCP incast occurswhenever a single receiver receives data from multiplesenders in a short amount of time. This is a frequent com-munication pattern in datacenter applications. As depict-ed in Figure 5, when this situation occurs, the switch towhich the receiver is attached is often overloaded: thesenders send more data than the receiver can receive; theswitch cannot store all of the data; and so the switch dis-cards data that it does not have room for [13]. Unlikedelayed ACK-induced timeouts, incast is much more dif-ficult to remedy, and we will spend much of this paperdiscussing this problem.

Sender

Sender

Sender

Sender

Switch Receivern * L L

L = Link bandwidth

n senders

Discards

Figure 5: Incast

Previous work discussing incast and other datacenterTCP problems has focused on Internet service applica-tions and shown that TCP performs poorly in datacentersthat are servicing these applications. While the natureand structure of our datacenter applications are very dif-ferent, we still experience similar problems with TCP inour datacenter.

Consider our typical application structure discussedpreviously and depicted in Figure 1. Each transformationmay read several data elements from our distributed stor-age systems, and may store several data elements into ourdistributed storage systems. As a result, reads from ourdistributed storage systems often result in a high degreeof TCP incast. Writes to the distributed storage systemsalso contribute to incast as many writers may be writingto the same storage node.

At a high level, we find that incast produces the fol-lowing problems at the application layer:

• Communication timeouts and retransmissions

• Lost throughput

• Increased latency

• Latency variance (jitter)

These problems can afflict even “innocent” applica-tions and servers uninvolved in the communication. Atthe business level, further problems result:

• Application failures

• Idle servers waiting for communication, and in-creased costs associated with procuring and oper-ating additional servers.

• Application failures even for “innocent bystanders”

• Development effort to work around communicationproblems

• Effort lost troubleshooting network problems in in-nocent applications

• Effort lost coordinating among different develop-ment groups to avoid communication problems.

3.3 Receive Buffer TuningIn addition, a very significant problem that we have en-countered with TCP in the datacenter is receive buffertuning [16]. The receive buffer size has a dramatic im-pact on TCP performance and server RAM utilization.

0

1

2

3

4

5

6

7

8

9

10

0 5 10 15 20

Thro

ughp

ut (G

bps)

Time (seconds)

Flow 1Flow 2

Figure 6: TCP convergence

0

1

2

3

4

5

6

7

8

9

10

0 5 10 15 20

Thro

ughp

ut (G

bps)

Time (seconds)

Flow 1Flow 2

Figure 7: DCTCP convergence

4


To illustrate this, consider the results of a simple two-flow throughput experiment. Both flows were sent fromdistinct servers to a common receiver. The first flow ranfor 20 seconds. The second flow started 5 seconds later,and ran for a total of 10 seconds. The results are shownin Figure 6.

TCP convergence, fairness, and stability in this test areall extremely poor. TCP should be able to converge with-in a few RTT, not several seconds. (While [11] discussessome detailed problems with TCP-CUBIC convergence,the behavior shown in Figure 6 is far worse than is ex-pected.)

Figure 7 repeats this test for DCTCP. Surprisingly,while [2] finds that DCTCP converges more slowly thanTCP, Figure 7 shows DCTCP dramatically outperform-ing TCP with respect to stability, convergence, and short-term fairness.

The source of this unexpected behavior is receivebuffer tuning. This will be addressed in detail in Sec-tion 9.

3.4 SummaryThe problems discussed above are significant, and histor-ically we worked around them at the application layer. Inthe following sections, we discuss how we have large-ly eliminated these problems, dramatically increasedour network performance, and removed the need forapplication-level workarounds.

4 Delayed ACKsAs discussed earlier, delayed acknowledgements cancause significant problems. Delayed ACK timeoutsare—by default—far too large for a datacenter setting.Fortunately, there are two simple alternatives to remedythis problem: 1) eliminate delayed acknowledgements,or 2) reduce the delayed acknowledgement timeout. Wehave investigated both approaches.

If ACKs could be generated without cost, the idealACK delay would be zero, and an ACK would be gener-ated for every single packet. Unfortunately, while elim-inating delayed ACKs eliminates the possibility of anysender stall, it does so at the cost of generating a signif-icant number of packets. We do not find this increasedload to be a problem in our network, but we do find it tobe problematic on our end servers.

Figure 8 illustrates this behavior. In this test, one ortwo senders send to a single receiver. Delayed ACKs aredelayed a maximum of 0 (i.e. no delayed ACKs), 1, or 40milliseconds. The total CPU % utilized by IRQ daemonson the receiver for the given test is plotted for each test(100% is the equivalent of 1 CPU completely busy). Thistest exhibits essentially no difference for delays of 1 and40 milliseconds. Turning delayed acknowledgements en-tirely off, however, produces a sharp increase in CPU uti-

lization for both one and two flows. (Repeating this testyields similar results with insignificant variation.)

0

5

10

15

20

25

30

35

40

0 5 10 15 20 25 30 35 40

IRQ

CPU

Util

izatio

n

Maximum ACK Delay (ms)

1 Flow

2 Flows

Figure 8: Delayed ACK CPU Utilization

For this reason, in our network we now lower the de-layed ACK timeout as much as possible without turningdelayed ACKS off. Those constraints yield a delayedACK setting of 1 ms. We will still incur an occasionalstall, but the stall is not long enough to cause significantissues at our application layer. (Some applications, how-ever, may benefit from turning off delayed ACKs entire-ly.)

5 Reducing RTOmin

During the most communication-intensive phases of ourapplication, we found that our applications were expe-riencing large numbers of incast-induced TCP timeouts.At the application layer, this resulted in a long tail on ourtask completion times. The effects of incast are clear-ly seen in Figure 9 which shows a TCP sequence graphfrom a single flow of a production application during aheavy all-to-all incast. The duplicate sequence numbersvisible are packet losses and retransmissions that weresuccessfully handled by TCP. The 200 ms pauses in theflow, however, are due to whole-window loss inducedTCP timeouts incuring the RTOmin penalty.

Previous work [13] has proposed a simple techniqueto mitigate the effects of incast-induced TCP timeouts:reduce RTOmin. We employed this technique in our dat-acenter, and the benefits can be seen in Figure 10 whichshows TCP sequence plot of a flow experiencing incast.As with Figure 9, loss is visible, as is a timeout, but time-outs are reduced to 5 ms which is the minimum effectiveRTOmin that our servers support.

As shown in Figures 9 and 10, reducing RTOmin sig-nificantly improved the performance of TCP in our dat-acenter by mitigating the effect of TCP timeouts. It did

5


150000

100000

50000

0

Sequence Number

0 50 Time (ms)

100 150 200 250 300 350 400 450 500 550

RTO

RTO

Figure 9: RTOmin 200 ms

7000

6000

5000

4000

3000

2000

1000

0

Sequence Number

0 Time (ms)

20 40 60 80 100

RTO

Figure 10: RTOmin 5 ms

not, however, prevent timeouts. In fact, the rate of packetloss in our network increased significantly after we ap-plied the RTOmin change. This is expected as loweringthe RTOmin does not prevent packet loss and timeouts, itjust mitigates the effects. Moreover, lower timeout val-ues will increase the number of contending flows whichwill tend to increase the overall number of lost packets.

In short, we found that reducing RTOmin greatly re-duced the impact of incast on our applications. Networkand server stability were not impacted by this changeeven when running on a cluster of over 2,000 servers. In-nocent applications (applications not involved in the in-cast) were, however, still impacted. Moreover, networkperformance was still far from ideal. We were still incur-ring (much smaller) timeouts and the usual TCP latency.In the next section, we discuss addressing the root causeof incast.

6 DCTCPSubsequent work on datacenter TCP has proposed sever-al techniques to actually reduce packet losses due to in-cast, rather than just mitigate the effects of lost packets.Of these techniques, one of the most promising for de-ployment in our datacenter is DCTCP. DCTCP possess-es several features that make it a particularly promisingapproach for us: it relies on capabilities that are avail-able in current hardware and software, an implementa-tion is available [8], and it does not contain features thatwe cannot use. (In particular, we decided against lever-aging work that relies on flow priority or deadlines as ourconnections are long-lived and utilized for many differ-ent types of communication. As a result, communicat-ing priority or deadline information to the network layerwould be difficult or impossible for our applications.)

Our primary objectives for moving to DCTCP were to:eliminate TCP timeouts (or nearly eliminate them), re-duce latency, and reduce the network-induced couplingof applications. In particular, we wanted to protect “in-nocent bystanders” from aggressive applications.

In the following sections, we first discuss obstacles toreaping these benefits, and how we extended DCTCP toovercome these obstacles, followed by some discussionof our extended DCTCP’s performance.

7 DCTCP Deployment Challenges7.1 Coexistence with TCPIn motivating the design of DCTCP, [1] states “[a data-center] network is largely homogeneous and under a sin-gle administrative control. Thus backward compatibility,incremental deployment and fairness to legacy protocolsare not major concerns.” For actual usage in our data-center, however, these are all major concerns. We do nothave the luxury of a “big bang” deployment for severalreasons.

• There are multiple applications running in our data-center with distinct ownership. It is critical that oneapplication moving to DCTCP does not negativelyimpact any application using conventional TCP. Re-call that one of our major arguments for deployingDCTCP is to reduce the coupling of applications.

• Many critical services cannot be moved to DCTCP.Even for applications with owners willing to makethe move to DCTCP, there are services used bythose applications that we simply cannot move toDCTCP. For instance, many of our applicationsleverage file servers that do not support DCTCP.

Unfortunately, DCTCP and TCP do not naturally co-exist well. To demonstrate this, we conducted a simpletest where one TCP flow and one DCTCP flow both send

6


at maximum rate from distinct servers to a single receiv-er. (Again, these experiments are conducted on a 10Gbps network using iperf as sender and receiver.) TheTCP flow lasts for a total of 20 seconds. The DCTCPflow lasts for 10 seconds and starts 5 seconds after theTCP flow starts.

The results are shown in Figure 11. As soon as theDCTCP flow starts, the TCP flow almost completelystops while the DCTCP flow completely saturates thelink. This is an extremely negative result, and essential-ly the complete opposite of what we require. (Note thatit is possible to delay the onset of this behavior throughconfiguration settings, but this will not solve the funda-mental problem.)

0

1

2

3

4

5

6

7

8

9

10

0 5 10 15 20

Thro

ughp

ut (G

bps)

Time (seconds)

CUBICDCTCP

Figure 11: DCTCP Coexistence with TCP

ECT

Marking Threshold

Not-ECT

ECTCE

ECTCE

Not-ECT

Not-ECT

Packets without ECT discarded above marking threshold

Figure 12: Switch RED ECN Implementation

The reason that DCTCP traffic dominates convention-al TCP traffic is due to RED/ECN AQM behavior whichis as follows for a switch configured for DCTCP. As de-picted in Figure 12, when the switch queue length is be-low the marking threshold (there is only one thresholdfor DCTCP), any packet that arrives is simply queuedirrespective of ECT status. When the queue length isover the marking threshold, however, all ECT packetsare marked with CE, but non-ECT packets are dropped.In DCTCP, the marking threshold is set very low valueto reduce queueing delay, thus a relatively small amountof congestion will exceed the marking threshold. Duringsuch periods of congestion, conventional TCP will suffer

packet losses and quickly scale back cwnd. DCTCP, onthe other hand, will use the fraction of marked packets toscale back cwnd. Only when all packets are marked willcwnd be scaled back as far as conventional TCP. Thusrate reduction in DCTCP will be much lower than thatof conventional TCP, and DCTCP traffic will dominateconventional TCP traffic traversing the same link.

As both TCP and DCTCP must service the sameservers in our network, we resort to utilizing IP DSCPbits to segregate DCTCP traffic from conventional TCPtraffic. AQM is applied to DCTCP traffic, while TCPtraffic is managed via drop-tail queueing.

7.2 Non-compliant switchesWhile we are fortunate enough to have support for ECNmarking on our top-of-rack switches, this is the on-ly location in our network that supports ECN marking.Higher-level switches are purely drop-tail. DCTCP mustgracefully support transit over non-ECN switches with-out impacting either the behavior of DCTCP traffic orconventional traffic. Our tests show that DCTCP suc-cessfully resorts to loss-based congestion control whentransiting a congested drop-tail link.

7.3 Non-technical challengesEven without any technical challenges, altering the net-work in a major enterprise is a difficult undertaking. Net-work administrators are, necessarily, risk-averse. A reli-able network is a business-critical requirement. Thus,network innovations are often viewed as presenting sig-nificantly more risk than reward.

We were able to present a compelling case for DCTCPimplementation due to the following:

• Reduction in coupling. Application coupling was aknown phenomenon in our datacenter. Convention-al TCP’s strong coupling of unrelated applicationscauses problems as discussed previously. DCTCP’spromise to greatly reduce the coupling between ap-plications meant that our network administratorswould directly benefit from reduced troubleshoot-ing requests from applications experiencing myste-rious network performance issues caused by unre-lated applications.

• Timing. We timed our DCTCP roll-out to coincidewith the deployment of new network switches in ourenvironment. We worked with our network admin-istrators to ensure that the switch features necessaryto support DCTCP were available from day one.

• Primum non nocere. Our support for conventionalTCP and non-ECN compliant switches enabled usto guarantee that we would not harm existing appli-cations.

7


7.4 Connection Establishment

Segregating DCTCP from conventional TCP removedone potential showstopper from our DCTCP deploymenteffort. Nevertheless, we encountered one other majorproblem in DCTCP that had the potential to preventDCTCP adoption in our network: we found that underload, DCTCP would fail to establish network connec-tions due to a lack of ECT in SYN and SYN-ACK pack-ets.

[1] does not discuss setting ECT on SYN and SYN-ACK packets. The Stanford implementation [8] doesnot set ECT on either SYN or SYN-ACK packets. Thisis in line with RFC 3168 [14] which states “A hostMUST NOT set ECT on SYN or SYN-ACK packets.” RFC5562 [10] (derived from ECN+ [9]) proposes settingECT on SYN-ACK packets, but maintains the restrictionof no ECT on SYN packets.

RFC 3168 and RFC 5562 prohibit ECT in SYN pack-ets due to security concerns regarding malicious SYNpackets with ECT set. These RFCs, however, are intend-ed for general Internet use, and do not directly apply toDCTCP. In our internal network, we do not tolerate thecompromised servers necessary for an attacker to sendsuch packets. Moreover, the Stanford implementation’sadoption of these RFCs likely owes more to its leverag-ing of the existing ECN support in Linux than anythingelse.

We find that setting ECT on SYN and SYN-ACK iscritical for the practical deployment of DCTCP. With-out this feature, SYN and SYN-ACK packets will bedropped whenever there is even minor congestion. Asdiscussed in Section 7.1, and depicted in Figure 12,whenever the queue length is greater than the mark-ing threshold, non-ECT packets are dropped. Thus, ifSYN and SYN-ACK packets are non-ECT they will bedropped with high probability. We modified DCTCP toapply ECT to both SYN and SYN-ACK packets. We re-fer to this implementation as “DCTCP+” to distinguishit from the original DCTCP implementation. (Followingthe naming convention of ECN+ which extended ECNwith ECT on SYN-ACK only.)

To measure the effect of this issue, we conducted anexperiment where we disabled ECT for SYN packetsand attempted to establish a DCTCP connection (withno SYN or SYN-ACK ECT) in the presence of a numberof competing DCTCP+ flows which were already estab-lished and sending data at maximum rate. As shown inFigure 13, as the number of competing flows increases,it quickly becomes hard, then impossible, to establish aconnection when SYN packets are non-ECT. Thus, weutilize DCTCP+ in our deployment, which marks bothSYN and SYN-ACK packets as ECT.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 4 8 12 16

Conn

ectio

n Pr

obab

ility

Competing Flows

Figure 13: Connection Probability without SYN ECT

Note that given our support for conventional TCP, wecould use DSCP to cause SYN and SYN-ACK packetsonly to be treated as conventional TCP. We do not takethis approach as it would split packets from a single flowacross two separate paths in our network which is highlyundesirable.

8 DCTCP+ PerformanceWe now discuss several elements of DCTCP+ perfor-mance illustrating where DCTCP+ does well, and wherethere is room for improvement.

8.1 Incast Throughput and Fairness withBuffer Tuning Active

We first measured performance in an incast scenario sim-ilar to that in Figure 5. In this case, a single receiver re-ceived traffic from 19 senders for a total of 10 seconds(as discussed previously all experiments in this sectionare conducted on a 10 Gbps network). Importantly, auto-matic receive buffer tuning is on for this test; we will latershow that this has a dramatic effect on TCP performancebut very little for DCTCP+. Figures 14 and 15 showsummarized throughput statistics for all 19 flows for eachexperiment. DCTCP+ fairly distributes the link band-width among flows resulting in a very narrow through-put distribution while fully utilizing the link. TCP is alsoable to fully utilize the link, but does so very inefficientlyas flows stall due to a combination of packet loss and in-correctly sized receive buffers. The link is able to remainutilized, however, as other flows step in and utilize themissing bandwidth. Nevertheless, the median through-put is lower, and there is a large variation among flowthroughput. In short, under DCTCP+, flow performanceis fast and reliable while under TCP, packet loss and thepoor performance of buffer auto tuning causes extremelyvariable throughput.

8


0

100

200

300

400

500

600

700

800

1 2 3 4 5

Flow

Thr

ough

put (

Mbp

s)

Test

Mean

Median

Max

Min

Figure 14: DCTCP single-receiver incast

0

100

200

300

400

500

600

700

800

1 2 3 4 5

Flow

Thr

ough

put (

Mbp

s)

Test

Mean

Median

Max

Min

Figure 15: TCP single-receiver incast (buffer tuning ac-tive)

TCP DCTCP+Mean 4.01 0.0422Median 4.06 0.0395Maximum 4.20 0.0850Minimum 3.32 0.0280σ 0.167 0.0106

Table 1: Per-packet latency in ms

Moreover, as shown in Table 1, per-packet laten-cy under TCP is two orders of magnitude greater thanper-packet latency under DCTCP+. DCTCP+’s reliablylow latency enables higher-layer applications to reliablycommunicate in a very short time span. Under TCP (par-ticularly before we lowered RTOmin), our applicationsneeded added logic to deal with the unpredictable laten-cy and throughput that incast induced. DCTCP+’s con-sistently superb performance make it a superior transportprotocol to TCP within our datacenter.

8.2 ScaleThe scalability afforded by datacenter computing lies atthe heart of applications ranging from web search en-gines, to the Monte Carlo simulations and data analyt-ics running in our datacenter. Realizing the benefits ofscale, however, is challenging for many components ofnetworked systems. For DCTCP+ to be an effective dat-acenter transport mechanism, it must scale with the ap-plications that it supports.

In this section, we examine the scalability of DCTCP+experimentally and analytically.

Incast traffic patterns are particularly difficult to scale.We examined DCTCP+ support at scale for incast bysending large numbers of long-lived flows (20 seconds)from many senders to a single receiver. Each flow wasgenerated by a distinct server using iperf. Ideally, weshould see that—as with TCP—the link would be fullyutilized and each flow would receive a fair share of thelink, but—unlike TCP—latency would remain low.

The results of this test are shown in Tables 2 and 3.Table 2 shows that throughput and long-term fairness areexcellent through 500 servers. Table 3, however, exhibitssome problems. The first problem to notice is that laten-cy is relatively high even for 100 servers. At 300 servers,the high latency shows that the receive queue is entire-ly full, and at 400 servers significant amounts of trafficare lost and numerous timeouts are occurring. By 500servers, 8.7% of packets sent are retransmissions, andtimeouts are very significant; as a result, short term flowfairness will be poor. In a nutshell, it seems that at thisscale, DCTCP+ is performing no better than TCP.

Senders Total Mean Max Min σ100 9,901 99.0 99.3 88.6 1.06200 9,900 49.7 49.9 46.1 0.35300 9,901 33.2 34 31.2 0.36400 9,894 24.9 28.5 20.2 1.01500 9,895 20.0 23.9 13.8 1.42

Table 2: Scale Test: Throughput (Mbps)

RetransmissionsSenders RTT (ms) Total % RTO100 1.60 0 0 0200 3.11 0 0 0300 4.38 3 0 0400 4.42 702 4.6 274500 4.44 1110 8.7 655Table 3: Scale Test: Latency and Retransmissions

Why is latency so high? Shouldn’t the switch be mark-ing packets causing DCTCP+ to back off before latencygets so high? Packet traces from a sender involved in thistest show that for all cases, the switch is marking 100%of packets in steady state, yet DCTCP+ is still sending

9


packets. In other words, even when the switch is tellingDCTCP+ to fall back aggressively, DCTCP+ refuses tofall back enough to prevent congestion.

The source of this behavior is in the cwnd update pro-cedure of DCTCP+. According to [1], DCTCP+ updatescwnd as:

cwnd ← cwnd × (1−α/2)

Actual TCP implementations, however, are more intri-cate, and the Linux implementation in [8] updates cwndas follows:

cwnd_new = max(tp->snd_cwnd

- ((tp->snd_cwnd

* tp->dctcp_alpha)>>11),

2U);

In other words, irrespective of measured congestion,DCTCP+ will always be willing to send two segments.This effectively puts a lower limit on DCTCP+ transmis-sion rate per sender of:

TransmissionRate ≥ SegmentSize×2RT T

The resulting load for our scale test is shown in Ta-ble 4.

Senders Load (Gbps)100 3.27200 6.55300 9.82400 13.09500 16.36

Table 4: DCTCP+ Load vs. Scale

By 300 servers, load is nearly at the capacity of thelink, and at higher scales, the load exceeds the link ca-pacity. The result is the significant packet drops, re-transmissions, and timeouts shown above. In effect,once the load due to the DCTCP+ minimum transmis-sion rate exceeds the link capacity, DCTCP+ congestioncontrol is no longer in effect, and TCP congestion con-trol takes over. Hence, at scales higher than 300 in thistest, DCTCP+ congestion control is no longer in effect.

DCTCP+ scale can be extended by reducing the min-imum transmission rate per server. This can be done byapplying the cwnd cap logic found elsewhere in the Lin-ux TCP implementation.

cwnd_new = max(tp->snd_cwnd

- ((tp->snd_cwnd

* tp->dctcp_alpha)>>11),

1U);

cwnd_new = min(cwnd_new,

tcp_packets_in_flight(tp) + 1U);

With this addition, under a congested network, onlyone packet will be allowed per RTT and the scaling will

double – just over 600 servers can send at full rate to asingle receiver without the minimum DCTCP+ transmis-sion rate exceeding the link capacity.

While this change may result in additional delayedacknowledgements, our initial evaluation indicates thatlowering the delayed acknowledgement timeout as dis-cussed in Section 4 mitigates this concern. We leave afull evaluation for future work.

8.3 Operational ExperienceWe have been running DCTCP+ at a scale of approx-imately 600 servers for nearly one and a half years asof this writing. While quantifying the isolated bene-fits of DCTCP+ is ongoing work, qualitatively, we havefound DCTCP+ to be a stable transport protocol andwith the RTOmin reduction, delayed ACK reduction,and DCTCP+ all in place, we no longer observe anyapplication-layer issues that are caused by TCP. This is asignificant improvement.

9 Receive Buffer TuningFigures 6 and 7 previously showed an unexpected result:TCP converging more slowly than DCTCP, and gener-ally performing very poorly. Careful analysis of Fig-ures 17a&b in [1] shows that the creators of DCTCPobserved similar behavior experimentally (though thisparticular behavior was not discussed in [1]): DCTCPoutperforms TCP in their experiment with respect to sta-bility, convergence, and short-term fairness. We repeatthis convergence experiment in our network under sev-eral scenarios—first on a 1 Gbps network, then on a 10Gbps network. The 1 Gbps result for TCP is shown inFigure 16. This closely matches the results from [1].

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 50 100 150 200 250 300

Thro

ughp

ut (G

bps)

Time (seconds)

Flow 1

Flow 2

Flow 3

Flow 4

Flow 5

Figure 16: 1G, TCP, Buffer Tuning OnFigure 17 shows that moving to a 10 Gbps network

exacerbates the problems with convergence, fairness andstability. We find that, as with the 1 Gbps result presentedin [1], at 10 Gbps DCTCP+ convergence is superior toTCP convergence as shown in Figure 18.

10


0

1

2

3

4

5

6

7

8

9

10

0 50 100 150 200 250 300

Thro

ughp

ut (G

bps)

Time (seconds)

Flow 1

Flow 2

Flow 3

Flow 4

Flow 5

Figure 17: TCP, Buffer Tuning On

0

1

2

3

4

5

6

7

8

9

10

0 50 100 150 200 250 300

Thro

ughp

ut (G

bps)

Time (seconds)

Flow 1

Flow 2

Flow 3

Flow 4

Flow 5

Figure 18: DCTCP+, Buffer Tuning On

These results seemingly defy [2] which showedDCTCP converging more slowly than TCP. The causeof this problem has a simple explanation: receive buffertuning. Historically, network developers were taskedwith setting TCP buffer sizes manually. Getting thebuffer sizes right is important for both network and end-system performance: undersized buffers hurt networkthroughput; overly generous buffer sizes consume RAM,impact system performance, and limit application scale.It is possible to manually set buffer sizes to attempt tostrike a balance, but this is very undesirable as it bindsthe performance of an application to the behavior of aparticular network. Moreover, it fails to allow dynam-ic memory management to take into account a server’smemory state.

To overcome these limitations, several approacheshave been developed to dynamically set TCP buffersizes. Unfortunately, in a datacenter setting, these al-gorithms can perform poorly. In principle, the receivebuffer of an application should be set to the bandwidthdelay product (BDP) of a link. The trouble is that inside

of a datacenter, propagation delay is extremely small—approximately four orders of magnitude less than thequeueing delay of a congested link! As a result, thebandwidth delay product of a link varies significantly,and—worse—is a function of the receive buffer size.The strong feedback present in receive buffer tuning aTCP link makes tuning a difficult problem. The tuningalgorithm takes many seconds to adapt from the low-latency congestion-free regime to the high latency con-gested regime. As a result, TCP performance in our data-center is very poor when automatic receive buffer tuningis enabled. This also explains why DCTCP+ is able tooutperform TCP: DCTCP+ keeps latency far lower thanTCP. As a result, the tuning algorithm experiences farless feedback and has a much easier time finding the cor-rect buffer size.

0

1

2

3

4

5

6

7

8

9

10

0 50 100 150 200 250 300

Thro

ughp

ut (G

bps)

Time (seconds)

Flow 1

Flow 2

Flow 3

Flow 4

Flow 5

Figure 19: TCP, Buffer Tuning Off

0

1

2

3

4

5

6

7

8

9

10

0 50 100 150 200 250 300

Thro

ughp

ut (G

bps)

Time (seconds)

Flow 1

Flow 2

Flow 3

Flow 4

Flow 5

Figure 20: DCTCP+, Buffer Tuning OffTurning off receive buffer tuning, and manually set-

ting the receive buffer size to be greater than the maxi-mum delay bandwidth product possible, results in muchbetter behavior for TCP, as shown in Figure 19. Withthis change, TCP stability, convergence, and fairness all

11


exceed that of DCTCP+. DCTCP+ performance, on theother hand, is not changed significantly by manually set-ting the buffer size, as shown in Figure 20.

In summary, receive buffer tuning can have a dramat-ic impact on TCP performance. The anomalous resultsshown in Figures 17a&b of [1], and discussed in thispaper, are explained by poor tuning of the TCP receivebuffer. With proper receive buffer sizing, TCP stability,convergence, and fairness outperform DCTCP+. Achiev-ing proper receive buffer sizing, however, is much moredifficult under TCP than DCTCP+ due to the massive dy-namic range of latencies that even two competing flowscan generate.

10 Related WorkTCP incast was first discussed by Nagle et al. [12]. Phan-ishayee et al. [13] explored solutions such as reduc-ing RTOmin. Vasudevan et al. [17] proposed reducingRTOmin further using fine-grained timers. Instead of this,we simply reduced RTOmin as far as our kernel was ca-pable of.

Yu et al. [20] analyze application performance in thedatacenter network of an Internet service provider; theyidentify several performance problems caused by appli-cations, the end-server network stacks, and the networkitself. We independently have encountered similar prob-lems in a completely different context, and we believethat the problems encountered in [20] are general prob-lems likely to be found widely in datacenter communi-cation. To fix the problems with delayed acknowledge-ments, [20] suggests either reducing the delayed acktimeouts or disabling delayed acks. Our work goes fur-ther by analyzing the tradeoff between these two options.

Wu et al. [19] also observe that switches runningRED/ECN drop non-ECT packets, but do not discuss theimpact of this behavior on DCTCP.

Semke et al. [16] developed a method of automaticallytuning TCP buffers that is the basis of the current Linuxautotuning algorithm.

There has been a good deal of work—such as[7] [18] [21]—on achieving superior congestion controlthan that attainable by DCTCP by incorporating knowl-edge of flow priorities and deadlines into congestion con-trol. Unfortunately, these techniques are not readily ap-plicable in our environment.

pFrabric [3] takes a clean-slate approach to datacentercommunication. This is promising work, but outside ofthe scope of our work as we were restricted to techniquesthat we could run in production today.

11 ConclusionTCP has been tremendously successful in the Internet,and is a ubiquitous protocol that is critical to count-less applications. TCP support in datacenters promises

to allow these applications to run alongside new appli-cations. Unfortunately, however, experience has shownthat TCP’s design assumptions break down inside mod-ern datacenters, and performance is often inadequate.

In this paper, we have shown that leveraging recentwork overcomes the major deficiencies of TCP inside ofthe datacenter. We have shown that DCTCP coexistencewith TCP is critical in our environment, and demonstrat-ed how this can be accomplished. Moreover, we haveshown how a small extension to DCTCP—employingECT in SYN and SYN-ACK packets—removes a poten-tially fatal problem with DCTCP.

Nevertheless, this work has also highlighted areas forfuture work. Despite the dramatic impact on perfor-mance that it can have in current implementations, re-ceive buffer auto tuning can perform very poorly. Inaddition, we have shown how DCTCP scale can be im-proved; ideally DCTCP would scale even further beforefilling queues and reverting to TCP.

In closing, deploying recently developed improve-ments to TCP (along with our extensions) has dramati-cally improved TCP performance in our datacenter, with-out requiring any modifications to our applications ordistributed storage systems.

AcknowledgementsWe would like to thank our reviewers for their feed-back. In addition, we greatly appreciate the assistanceof Daniel Varga, Wai-Hong Lui, Vikas Chawla, EricHagberg, John MacIntyre, Peter King, Johnathan Ed-wards, Sam Shteingart, Michael Jansen, Jason Green-berg, Matthew Whitehead, Daniel Borkmann, FlorianWestphal, and Frank Hirtz,

References[1] M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye,

P. Patel, B. Prabhakar, S. Sengupta, and M. Srid-haran. Data center tcp (dctcp). In Proceedings ofSIGCOMM 2010, 2010.

[2] M. Alizadeh, A. Javanmard, and B. Prabhakar.Analysis of dctcp: Stability, convergence, and fair-ness. In Proceedings of SIGMETRICS 2011, 2011.

[3] M. Alizadeh, S. Yang, M. Sharif, S. Katti, N. McK-eown, B. Prabhakar, and S. Shenker. pfabric: Mini-mal near-optimal datacenter transport. In Proceed-ings of SIGCOMM 2013, 2013.

[4] L. A. Barroso and U. Hlzle. The Datacenter asa Computer: An Introduction to the Design ofWarehouse-Scale Machines. Morgan and ClaypoolPublishers, 2009.

[5] P. Gill, N. Jain, and N. Nagappan. Understand-ing network failures in data centers: measurement,

12


analysis, and implications. In Proceedings of SIG-COMM 2011, 2011.

[6] S. Ha, I. Rhee, and L. Xu. Cubic: a new tcp-friendlyhigh-speed tcp variant. ACM SIGOPS OperatingSystems Review, 2008.

[7] C.-Y. Hong, M. Caesar, and P. Godfrey. Finishingflows quickly with preemptive scheduling. In Pro-ceedings of SIGCOMM 2012, 2012.

[8] A. Kabbani, M. Yasuda, and M. Alizadeh. Dctcp-linux. In https://github.com/myasuda/DCTCP-Linux, 2012.

[9] A. Kuzmanovic. The power of explicit congestionnotification. In Proceedings of SIGCOMM 2005,2005.

[10] A. Kuzmanovic, A. Mondal, S. Floyd, and K. Ra-makrishnan. Adding explicit congestion notifica-tion (ecn) capability to tcp’s syn/ack packets. InRFC 5562, 2009.

[11] D. Leith, R. Shorten, and G. McCullagh. Experi-mental evaluation of cubic-tcp. In Proceedings ofPFLDnet 2008, 2008.

[12] D. Nagle, D. Serenyi, and A. Matthews. Thepanasas activescale storage cluster - delivering scal-able high bandwidth storage. In Proceedings of Su-percomputing 2004, 2004.

[13] A. Phanishayee, E. Krevat, V. Vasudevan, D. G.Andersen, G. R. Ganger, G. A. Gibson, and S. Se-shan. Measurement and analysis of tcp throughputcollapse in cluster-based storage systems. In Pro-ceeding of FAST 2008, 2008.

[14] K. Ramakrishnan, S. Floyd, and D. Black. The ad-dition of explicit congestion notification (ecn) to ip.In RFC 3168, 2001.

[15] J. Rothschild. High performance at massive scale:Lessons learned at facebook. In mms://video-jsoe.ucsd.edu/calit2/JeffRothschildFacebook.wmv.

[16] J. Semke, J. Mahdavi, and M. Mathis. Automatictcp buffer tuning. In Proceedings of SIGCOMM1998, 1998.

[17] V. Vasudevan, A. hanishayee, H. Shah, E. Kre-vat, D. Andersen, G. Ganger, G. Gibson, andB. Mueller. Safe and effective fine-grained tcp re-transmissions for datacenter communication. InProceedings of SIGCOMM 2010, 2010.

[18] C. Wilson, H. Ballani, T. Karagiannis, andA. Rowtron. Better never than late: Meeting dead-lines in datacenter networks. In ACM SIGCOMMComputer Communication Review, 2011.

[19] H. Wu, J. Ju, G. Lu, C. Guo, Y. Xiong, andY. Zhang. Tuning ecn for data center networks. InProceedings of CoNEXT 2012, 2012.

[20] M. Yu, A. Greenberg, D. Maltz, J. Rexford,L. Yuan, S. Kandula, and C. Kim. Profiling networkperformance for multi-tier data center applications.In Proceedings of NSDI 2011, 2011.

[21] D. Zats, T. Das, P. Mohan, D. Borthakur, andR. Katz. Detail: reducing the flow completion timetail in datacenter networks. In Proceedings of SIG-COMM 2012, 2012.

13

attaining the promise and avoiding the pitfalls of tcp in ... › system › files › conference...

Documents