2/19

U

NCORR

ECTEDP

ROOF

01

02

03

04

05

06

0708

09

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

3233

34

35

36

37

38

39

40

41

42

43

4445

46

47

48

49

50

51

52

53

54

2 Y. REN ET AL.

Figure 1. Typical TCP Incast scenario.

As the number of concurrent senders increases, the data transfer workload can overflow the bufferat the bottleneck switch, leading to packet losses, and subsequent TCP retransmissions. Therefore,significant degradation of application goodput called TCP-Incast throughput collapse can occur.

The TCP Incast issue potentially arises in many typical data center applications. For example, incluster storage, when storage nodes respond to requests for data, in Web search, when many workersrespond near simultaneously to search queries, and in batch processing jobs like MapReduce [4],

in which intermediate key-value pairs from many Mappers are transferred to appropriate Reducersduring the shuffle stage.

The TCP Incast issue has already attracted many researchers interest because of the developmentof Cloud computing. Many possible solutions have been proposed from the aspects of multiplelayers mainly including link layer, transport layer, and application layer. However, some of themhave high efficiency but the costs are also high, and, some need low cost but have weak efficiency.In this paper, the principles, merits, and drawbacks of the main proposals for solving the TCP Incastissue will be analyzed and summarized.

2. METHODS OF SOLVING THE TCP INCAST ISSUE

Generally speaking, there are two main methods of solving the TCP Incast issue. One is trying to

avoid the generation of packet loss or reduce packet loss. Also, the other is trying to quick recoverafter packet loss, to reduce the effect of packet loss.

2.1. Reduce packet loss

A key reason for TCP Incast is the many-to-one transport pattern. Many data senders send datasimultaneously, and the bottleneck switch buffers are overloaded, which leads to packet loss.Therefore, to reduce packet loss, there are some specific possible measures. For example, increasingthe switch buffer [3], globally scheduling of data transfers [5], and explicitly informing thecongestion state to data senders so that the senders can adjust sending rate correspondingly [6, 7],and so on.

Copyright 2012 John Wiley & Sons, Ltd. Int. J. Commun. Syst. (2012)

DOI: 10.1002/dac


3/19

U

NCORR

ECTEDP

ROOF

01

02

03

04

05

06

0708

09

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

3233

34

35

36

37

38

39

40

41

42

43

4445

46

47

48

49

50

51

52

53

54

TCP INCAST IN DATA CENTER NETWORKS 3

2.2. Quick recovery

Once the packet loss occurs, the TCP sender will wait for duplicate ACKs arriving or RTO timertimeout occurring, and then TCP will reduce the congestion control window and enter the congestionrecovery process. Possible solutions include reducing RTO value to let the TCP sender enterretransmission more quickly [1, 3], and designing a quicker congestion recovery algorithm, andso on [8].

3. EXISTING PROPOSALS AND ANALYSIS

So far, many possible solutions have been proposed. By the aspects of consideration, they can bemainly divided into link layer proposals, transport layer proposals, application layer proposals, andother proposals.

3.1. Link layer proposals

At

Q5

the link layer, there is much research on congestion control and flow control in wireless networks[9]. In data center networks, congestion control and flow control are mainly two methods to mitigate

the TCP Incast issue. The IEEE 802.1Qau Congestion Notification project [10] is concerned withthe specification of a Layer 2 congestion control mechanism, in which a congested switch can con-trol the rate of Layer 2 sources whose packets are passing through the switch, like TCP/RED [11].Also, the IEEE 802.1Qbb Priority-based Flow Control project [12] is concerned with introducing alink-level, per-priority flow control or PAUSE function.

3.1.1. Congestion control. The QCN (quantized congestion notification) algorithm [6] is a famousalgorithm developed for inclusion in the IEEE 802.1Qau standard to provide congestion control atLayer 2 in data center networks. It is composed of two parts.

The Congestion Point (CP) algorithm: The switch buffer samples incoming packets and generatesa feedback message containing the information about the extent of congestion at the CP to thesource. The CP buffer is shown in Figure 2. The congestion measure Fbis calculated by the formula F2

Fb D

QoffCwQ

, (1)

whereQoffDQQeq, Q D QQold D QaQd,Qdenotes the instantaneous queue-size,Qeqisthe desired operating point, Qold is the queue-size when the last feedback message was generated,Qaand Qddenote the number of arriving and departure packets between two consecutive samplingtimes respectively, and wis a non-negative constant. Fbcaptures a combination of queue-size excess(Qoff/and rate excess (Q).

The Reaction Point (RP) algorithm: The rate limiter associated with a source decreases its send-ing rate based on feedback received from the CP, and increases its rate voluntarily to recover lostbandwidth and probe for extra available bandwidth. Figure 3 shows the basic RP behavior. F3

Figure 2. Congestion detection in QCN CP [6].

Copyright 2012 John Wiley & Sons, Ltd. Int. J. Commun. Syst.(2012)

DOI: 10.1002/dac


4/19

U

NCORR

ECTEDP

ROOF

01

02

03

04

05

06

0708

09

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

3233

34

35

36

37

38

39

40

41

42

43

4445

46

47

48

49

50

51

52

53

54

4 Y. REN ET AL.

Figure 3. QCN RP operation [6].

Rate decreases: When a feed message is received, current rate (CR) and target rate (TR) areupdated as follows:

TR CR (2)

CR CR.1Gd jFbj/, (3)

where the constantGdis chosen to ensure that the sending rate cannot decrease by more than 50%,and thusGd

jFbmaxj D 1=2, whereFbmaxdenotes the maximum ofFb.

Rate increases: This occurs in two phases: Fast Recovery (FR) and Active Increase (AI).Fast Recovery: The CR is updated as follows:

CR 1

2.CR C TR/ (4)

Active Increase: At the end of a feedback message, the rate limiter updates TR and CR as follows:

TR TR CRAI (5)

CR 1

2.CR C TR/, (6)

whereRAIis a constant chosen to be 5 Mb/s in the baseline implementation.The authors in [13] proposed modifications on the CP and RP algorithm. They found that if every

packet is sampled, the performance is much better and collapse does not occur even for simulationswith 32 KB buffer size. However, sampling every packet might not be necessary. The authors pro-posed two strategies that offer almost the same level of performance as sampling every packet, whilerequiring fewer packets to be sampled during periods without congestion. Also, they proposed toreduce the amount by which an RP increases its rate during congestion by making the self-increaserate (R_AI) congestion aware. This is done by setting R_AI 5 Mb/s when no congestion occurs, but

a fraction of this amount with negative feedback.Approximately Fair QCN [14] proposes modifications on QCN to ensure a faster convergence tofairness than QCN. QCN sends the same congestion feedback value to all flows, while Approxi-mately Fair QCN distinguishes each flow based on their (estimated) sending rates and adjusts thefeedback to each flow accordingly.

Fair QCN [15] was proposed to improve fairness of multiple flows sharing one bottleneck link. Itfeeds QCN messages back to all flow sources, which send packets with the sending rate over theirshare of the bottleneck link capacity. The congestion parameter is calculated as follows:

Fb.i /D AiPN0

kD1Ak

Fb, (7)


DOI: 10.1002/dac


5/19

U

NCORR

ECTEDP

ROOF

01

02

03

04

05

06

0708

09

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

3233

34

35

36

37

38

39

40

41

42

43

4445

46

47

48

49

50

51

52

53

54


whereN0

is the total number of overrate flows, and Ak is the total number of the received packetsfrom thekthoverrate source flow. Thus, Fb.i /is proportional toFb.

3.1.2. Flow control. Phanishayee et al. proposes using Ethernet flow control (EFC) to solve theTCP Incast problem [3]. The overloaded switch that supports EFC can send a pause frame to theinterface sending data to the congestion buffer, informing all devices connected to that interface to

stop or forward data for a designed time. During this period, the overloaded switch can reduce thepressure on its queues.

The simulation experiment results show that EFC is effective with the configuration that all clientsand servers connect to a single switch. However, it does not work well with multiple switchesbecause of head-of-line blocking. When the pause frame from one congested buffer intends to stopthe flow causing the congestion, it also stops other flows because of the switchs FIFO mecha-nism. Another problem with EFC is the inconsistency of switchs implementation between differentswitch vendors.

For the above problems, a number of recent Ethernet initiatives [16] have been introduced to addcongestion management with rate-limiting behavior and to improve the pause functionality with amore granular per-channel capability. These initiatives contribute to creating a lossless and flow-control version Ethernet, referred to as Data Center Ethernet. IEEE 802.1Qbb priority flow control,

one of the functions of IEEE 802.1 Data Center Bridging, extends the basic IEEE 802.3x PAUSEsemantics to multiple CoSs Q6, enabling applications that require flow control to coexist on the samewire with applications that perform better without it [17].

Ethernet with lossless behavior will mitigate the Incast problem effectively, but new standardsimplementation in switch will take time and money.

3.2. Transport layer

At transport layer, the proposals can be divided into two types. One is modifying TCP parameterswhile keeping the TCP protocol unchanged, and the other is designing enhanced TCP protocols.

3.2.1. Modification on Transmission Control Protocol parameters.

3.2.1.1. Reducing the minimum retransmission timout timer. The default value of the TCP mini-mum RTO timer is 200 ms, which was originally designed for WAN environments. Unfortunately,this value is orders of magnitude greater than the round-trip time in general data center networks,which is typically around 100 s. Reducing the RTOmin to avoid TCP Incast is reasonable for thereason that this large RTOmin imposes a huge throughput leisure because the transfer time for eachdata block is significantly smaller than RTOmin. The simulation experiment results in [3] indicatethat reducing the minimum value of the RTO timer from the default 200 ms to 200 s improvesgoodput by an order of magnitude. However, as the authors also pointed out, reducing RTOminto 200 s requires a TCP clock granularity of 100 s, according to the standard RTO estima-tion algorithm. Also, BSD TCP and Linux TCP implementations are currently unable to providethis fine-grained timer. Therefore, it is hard to implement. Moreover, reducing the RTOmin valuemight be harmful, especially in situations where the servers communicate with clients in the wide

area network.The practical experiment results in [1] also verified this idea. Different RTOmin values rangingfrom 1 to 200 ms are experimented and compared. The basic result shows that smaller minimumRTO timer values mean larger values for the goodput.

3.2.1.2. Disabling the delayed ACK. The TCP delayed ACK mechanism attempts to reduce theamount of ACK traffic by having a receiver acknowledge only every other packet. If a single packetis received with none following, the receiver will wait up to the delayed ACK timeout thresholdbefore sending an ACK. The default minimum delayed ACK in Linux is 40 ms. In the smaller RTOenvironment, such as when the RTO is below 40 ms in the datacenter, before the sender receives theACK, it will incorrectly assume that a loss has occurred. The practical experiment results in [18]


DOI: 10.1002/dac


6/19

U

NCORR

ECTEDP

ROOF

01

02

03

04

05

06

0708

09

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

3233

34

35

36

37

38

39

40

41

42

43

4445

46

47

48

49

50

51

52

53

54

6 Y. REN ET AL.

show that the goodput with delayed ACK disabled is a bit bigger than the goodput with delayedACK enabled when the number of servers is beyond 8.

3.2.1.3. Removing binary exponential backoff. Reference [19] points out that removing the TCPbinary exponential backoff (BEB) can benefit throughout. Because under severe congestion, morethan 50% timeouts can invoke BEB, and then make the server stalled; the link is underutilized

during the waiting time for the stalled server to recover. Thus, it causes drastic dropping of thethroughput. Also, the data center network is different from the Ethernet. In the context of the datacenter network, packets are transferred by store-and-forward rather than broadcasting; applicationson nodes are synchronized read, which means that nodes (client or server) do have limited packetsto send. Thus, BEB is not suitable in the data center.

Removing BEB can mitigate the Incast problem, resulting in smaller dispersion of response timefor all servers to return their portion of data to the client [19]. The simulation experiment results in[19] show that it does not advance the onset of Incast collapse. Instead, it can even benefit from itwhen using larger SRU size and lower RTOmin.

However, Ref. [1] also proposes the similar solution: smaller multiplier for the RTO exponentialback off and randomized multiplier for the RTO exponential back off. In contrast, it points out thatboth of the solutions are unhelpful in preventing the TCP Incast problem, for the reason that thereis only a small number of exponential back off for the entire transfer.

Although Ref. [1] points out that altering exponential back off behavior has little impact on mit-igating the Incast, it did not describe its NSQ7 simulation environment to provide convincing result.In [19], it works well under severe congestion (more than 256 servers). However, it has little effectin middle and slight congestion. Therefore, the severe congestions occurrence of the entire transferneeds further measurement.

3.2.2. Enhanced Transmission Control Protocols. In recent years, there has been much researchwork on TCP congestion control in different environments, especially in wireless networks [20, 21],and fast long distance networks [22]. Also, there are also several TCP variants designed for datacenter networks. Among them, two noted enhanced TCP protocols are DCTCP (Data Center TCP)[7] and ICTCP (Incast Congestion Control for TCP) [8].

3.2.2.1. Data Center TCP. The goal of DCTCP is to achieve high burst tolerance, low latency, andhigh throughput, with commodity shallow buffered switches in common data center networks. Tothis end, DCTCP is designed to operate with small queue occupancies, without loss of throughputby using Explicit Congestion Notification (ECN) in the network to provide explicit feedback to theend hosts.

The DCTCP algorithm has three main components:

(1) Simple marking at the switch side. There is a single parameter, the marking threshold,K inthe switch side. An arriving packet is marked with the CEQ8 codepoint if the queue occupancy isgreater thanKupon its arrival. Otherwise, it is not marked.

(2) ECN-Echo at the receiver side. ACK every packet, setting the ECN-Echo flag if and only ifthe packet has a marked CEQ9 codepoint.

(3) Controller at the sender side. The sender maintains an estimate of the fraction of packets that

are marked, called , which is updated once for every window of data (roughly one RTT)as follows:

.1 g/ C g F, (8)

whereF is the fraction

Q10

of packets that were marked in the last window of data, and 0 < g


7/19

U

NCORR

ECTEDP

ROOF

01

02

03

04

05

06

0708

09

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

3233

34

35

36

37

38

39

40

41

42

43

4445

46

47

48

49

50

51

52

53

54


Thus, when is near 0 (low congestion), the window is only slightly reduced. When congestion ishigh ( D 1), DCTCP cuts window in half, just like TCP.

The authors in [23] developed a fluid model to provide a mathematical analysis for the throughputand delay performance of DCTCP. Their results show that DCTCPs throughput remains higher than94% even as the threshold Kgoes to zero. This is much higher than the limiting throughput of 75%for a TCP source as the buffer size goes to zero. Meanwhile, they pointed out that the convergence

of DCTCP is no more than 1.4 times slower than TCP, and the RTT-fairness of DCTCP is better thanTCPDrop-tail but worse than TCP with RED in the NS2 simulations. The authors also indicate thatin their analytic work based on the case where there is a single bottleneck, it will be very necessaryto understand the behavior of DCTCP in general networks in the future.

Although DCTCP performs well in the authors measurement in practical experiment environ-ment, another study in data center congestion [24] indicated that finding a switch that supports ECNis proved to be surprisingly difficult. The universality to deploy the hardware in common datacentersseems to be deficient.

3.2.2.2. Incast Congestion Control for TCP. Different from the previous approach to reduce theimpact of Incast congestion by a fine-grained timeout value, the ICTCP is designed at the receiverside by adjusting the TCP receive window proactively before packet losses occur.

The ICTCP algorithm has two main components:

(1) Control trigger by evaluating available bandwidth. Assume the link capacity of the inter-face on receiver server isC. Define the bandwidth of total incoming traffic observed on thatinterface as BWT. Then define the available bandwidth BWAon that interface as,

BWA Dmax.0, CBWT/, (10)

where 2 0, 1is a parameter to absorb potential oversubscribed bandwidth during windowadjustment.

(2) Window adjustment on single connection. The expected throughput of connection i isobtained as,

bei D max

bmi , rwndi=RT Ti

(11)

dbi D

bei bm

i

=bei, (12)

wherebm is measured throughput and be isthe expected throughput, rwnd and R T T arethe receive window and R T T for connectioni , respectively. Thedb is the ratio of through-put difference of measured and expected throughput over the expected one for connection i .The window adjustment is shown in Table I. T1

The practical experimental results in [8] demonstrated that ICTCP was effective for avoidingcongestion by achieving almost zero timeout for TCP Incast, and it provided high performance andfairness among competing flows.

The authors chose a different approach to implement the ICTCP in TCP stack by developingICTCP as a Network Driver Interface Specification driver on Windows Q11. This approach can directly

Table I. The ICTCP window adjustment [8].

db 6 1 Increase receive window if there is enough quotaof available bandwidth on the network interface.

Decrease the quota correspondingly if the receivewindow is increased.

db > 1 Decrease receive window by one MSS2 if thiscondition holds for three continuous RTT. Theminimal receive window is2 MSS.

1 < d b < 2 Keep current receive window.


DOI: 10.1002/dac


8/19

U

NCORR

ECTEDP

ROOF

01

02

03

04

05

06

0708

09

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

3233

34

35

36

37

38

39

40

41

42

43

4445

46

47

48

49

50

51

52

53

54

8 Y. REN ET AL.

support virtual machines, which are prevailing in data centers, but lead to poor portability for otheroperating systems; meanwhile, the limitation of this approach increases the difficulty to reproducethe experiment.

3.3. Application layer

At the application layer, the main idea is how to reduce packet loss. A common idea is restrictingthe number of participating severs in a synchronized data transfer. Specifically, there are some waysproposed as follows.

3.3.1. Increasing server request unit size. The probability of Incast events increases with the num-ber of servers engaged in a synchronized data transfer to any client. To counter this, the clientsshould request more data from fewer servers. The NS2 simulation experiment results in [3] illus-trate that increasing the SRU size can improve the overall goodput. For a fixed data block size,increased SRU size leads to a data block being stripped across fewer severs, so clients can requestthe same amount of data from smaller number of servers to avoid Incast. Besides, with larger SRUsize, severs will use the spare link capacity made available by any stalled flow waiting for a timeoutevent.

However, stripping data across fewer servers is contrary to the design of cluster-based storage

systems where data are stored across many storage servers to improve both reliability and perfor-mance. The storage system is also required to allocate pinned space in the client kernel memory tohandle larger SRU, thus it increases memory pressure, which may lead to client kernel failures infile system implementations.

3.3.2. Staggering data transfer. Staggering data transfer is another way to limit the number ofsynchronously communicating servers, and the staggering effect can be produced either by theclients or at the servers [5].

A client can stagger data transfer by requesting only a subset of the total block fragments at a time,or requesting from subset of servers, and then only a limited number of servers will synchronouslyrespond to the client.

Servers can randomly or deterministically delay their response to a request. A server can wait an

explicitly pre-assigned time or a random period before beginning its data transfer, thus limiting thenumber of servers participating in a parallel data transfer. Servers can also respond to other requestsand prefetch data into their cache during the delayed period.

Although staggering data transfer is theoretically an ideal way to prevent Incast, there are so farno sufficient experimental data to prove its efficiency. Its performance needs to be tested and verifiedin further researches and experiments.

3.3.3. Global scheduling of data transfer. Global scheduling of data transfer is required to handlethe situation that a client may be simultaneously running several workloads and making multiplerequests to different subsets of servers. This global scheduling can be designed to restrict the totalnumber of servers simultaneously responding to any client, thus avoiding Incast. One instantiationof this idea is to use SRU tokens, where a server cannot transfer data to a client unless it has that

clients SRU token; during the period that a server waits for the right token, it can prefetch data intoits cache. Reference [5] gives this idea but no experiment.The storage system may learn that to avoid experiencing Incast, it is safe for it to send data to

a given client from only k servers; the system would then create k SRU tokens for each client ina global token pool. Each client can send requests to all servers containing the data it requests,but only the servers that have been allocated that clients token can transfer the data to the client.This restriction can achieve an optional goal that only k servers simultaneously send data to anygiven client.

However, it is not easy to find the optimal value ofk . The storage system might obtain the valueofk either through manual configuration or real-time system measurements. For the former, manyexperiments might be required. Furthermore, when the environment is changed, a new configuration


DOI: 10.1002/dac


9/19

U

NCORR

ECTEDP

ROOF

01

02

03

04

05

06

0708

09

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

3233

34

35

36

37

38

39

40

41

42

43

4445

46

47

48

49

50

51

52

53

54


need to be carried out manually again; for the latter, an efficient measurement model is required toensure that the storage system can obtain the real-time and optimal value ofk.

3.4. Other proposals

In addition to the above proposals at each layer, there are also some other proposals considering the

physical layer, or transfer pattern, or data handling method, etc.

3.4.1. Larger switch buffer. To solve Q12the primary cause of Incast timeouts, the root cause oftimeouts packet losses should be mitigated, by increasing the switch buffer allocated at theEthernet switch. With a large enough buffer space, Incast can be avoided for a limited number.The simulation experiment results in [3] show that doubling the switchs output buffer doubles thenumber of servers that can be supported before the system experiments Incast.

Unfortunately, with the number of servers increasing, a switch with a larger buffer size is neededand that costs much more (the Force10 E1200 switch [23], which has very large buffers, costsover $500,000). Thus, there will be a difficult choice between cost and over-provisioning for thesystem designer.

3.4.2. Probabilistic retransmission. Another practical technique to solve TCP Incast is reduc-ing the time spent detecting a congestion instead of avoiding timeouts. These techniques rely onprobabilistic retransmissions, kernel threads, and duplicate ACKs.

The algorithm in [25] is as follows. First, the kernel thread retransmits the highest unacknowl-edged segment in the senders transmission window with a probability P, which is marked in oneof six reserved bits in the segment header. Second, the receiver will return a normal ACK followedby Duplicate ACKs threshold (dupackthresh, number of duplicate ACKs) as a feedback about thecongestion to the sender depending on the receiving situation of the original segment and the retrans-mitted segment. Third, the sender will automatically enter Fast Retransmit without waiting forretransmission timeouts when it receives dupackthresh ACKs in a row.

The simulation experiment results in [25] show that the above algorithms perform well. The algo-rithms show advantages compared with the default TCP (RTO D 200 ms) and even the modifiedTCP (RTO D 200s). But how to choose P is still a question to be considered. If P is set too

low, the technique will provide no significant benefit. Also, if it is set too high, it will cause unnec-essary retransmission, which contributes further to the congestion at the switch [25]. In addition,the paper does not refer to the performance of the number of servers more than 64, which alsoneeds considering.

3.5. Changing file stripping pattern.

In addition

Q13

to the factor of excessive severs participating in a synchronized data transfer, disk headcontention among clients caused by data access to popular files on I/O servers can also lead tothroughput collapse. The authors in [26] proposed a new file stripping strategy called storage servergrouping (SSG), which changes the file stripping pattern across the storage servers based on theanalysis of file popularity and impact of the number of storage servers on the clients perceived

performance (I/O speedup) to reduce the negative effect caused by Incast.Storage server grouping is proposed as a framework that can automatically change file strippingparameters, such as stripping unit size, stripping factor (the number of storage servers to store files),and stripping index (the first storage server for the stripping), in a online manner. It uses the proposedI/O speedup model to find the optimal number of storage servers before a file is stripped across stor-age servers. The I/O speedup model is trained by using relative machine learning technique [27] tocorrelate the number of storage servers with I/O performance of a workload. SSG keeps trackingfile popularity and intelligently separates files into different server groups by setting the strippingindex (the first storage server for the stripping), reducing data access interference on each group.SSG also periodically tunes the file stripping parameters based on the I/O workload characteristicsprofiled online. The authors have implemented the SSG scheme on top of a parallel file system,


DOI: 10.1002/dac


10/19

UNCORR

ECTEDPROOF

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

Table II. Comparative analysis of various proposals for TCP Incast.

Typical proposals Efficiency

Link layer Congestion control [6,10,1315] QCN can effectively control link rate, but performs High, need special switch poorly in Incast [13].

Flow control [3,16,17] EFC is effective in simple configure, but does not High, need to create a losswork well with multiple switches. standards implementatio

Transport Modification on TCP It can mitigate the Incast problem slightly. Low. Just modify the TCPlayer parameters [1, 18, 19]

Enhanced TCP protocols [7,8, 24] DCTCP performs much better than TCP in Incast High. DCTCP needs speciproblem when concurrent senders less than 35 [7]. changing TCP source co

ICTCP is effective to avoid congestion by achieving ICTCP needs changing TCalmost zero timeout for TCP Incast [8].

Application Increasing SRU size [3] The overall goodput can be improved. May cause memory pressulayer Need further experiments and researches.

Staggering data transfer [5] It can theoretically avoid Incast, but lack of Need further experiments experimental or practical data to prove its efficiency.

Global Scheduling of Data [5] It can effectively ensure the throughput if the High if configured manualoptimal valuek could be obtained in real time. model applied.

Others Larger switch buffer [3] Incast can be avoided for a limited number High, the larger switch bufof servers.

Probabilistic retransmission [25] It is effective in avoiding Incast with proper Low, need kernel thread wprobability P. to implement.

Changing file stripping SSG can improve system-wide I/O throughput by High if the popularity of fipattern [2628] up to 38.6% and 22.1% on average. frequently; low if rarely

Copyright2012JohnWiley&Sons,Ltd.

Int.J.Com

mun.Syst.(2012)

DOI:10.1002/dac


11/19

U

NCORR

ECTEDP

ROOF

01

02

03

04

05

06

0708

09

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

3233

34

35

36

37

38

39

40

41

42

43

4445

46

47

48

49

50

51

52

53

54


called Lustre [28]. Their experimental results show that SSG can improve system-wide I/Othroughput by up to 38.6% and 22.1% on average.

However, there might be some adaptability and usability problems in the SSG scheme. It can beeffective if the popularity of files rarely changes. Unfortunately, the cost may be high if the popu-larity of files changes substantially and/or frequently, because every time this situation appears, thereconfiguration of the stripping pattern is required accordingly.

4. COMPARISON

At the above section, the main proposals for solving the TCP Incast issue in a data center networkhave been described and analyzed. Although there seems to be many solutions, their implementationefficiency and cost varies a lot. Here, a comparison between these proposals is given as Table II. T2

5. CONCLUSION

In this paper, the methods to solve the TCP Incast issue in data center networks have beensummarized. According to the two methods, reducing packet loss and quick recovery, the existingproposals are described and analyzed by each TCP/Internet Protocol Q14layer. Although there are manyproposed solutions so far, there is no perfect solution yet. Many proposals are still not mature andneed further research and experimentation. Besides these proposals, designing a novel and suitablenetwork architecture for data centers to improve its performance is also a hot research topic [2931].

ACKNOWLEDGEMENTS

This work was supported in part by the Knowledge Innovation Program of the Chinese Academy of Sciencesunder Grant No.CNIC_QN_1203. The authors thank the editors and reviewers for their earnest and helpfulcomments and suggestions.

REFERENCES

1. Chen Y, Griffith R, Liu J, Katz RH, Joseph AD. Understanding tcp Incast throughput collapse in datacenter networks.InProceedings of 1st ACM workshop on Research on enterprise networking, NY, USA, 2009; 7382.

2. Wu W, Crawford M. Potential performance bottleneck in Linux TCP. International Journal of CommunicationSystemsNovember 2007; 20(11):12631283.

3. Phanishayee A, Krevat E, Vasudevan V, David G, Andersen DG, Ganger GR, Gibson GA, Seshan S. Measurementand Analysis of TCP Throughput Collapse in Cluster-based Storage Systems. In Proc. 6th USENIX Conference onFile and Storage Technologies (FAST 08), San Jose, CA, February 2008; 2629.

4. Ding Z, Guo D, Liu X, Luo X, Chen G. A MapReduce-supported network structure for data centers. Concurrencyand Computation: Practice and Experience2011 Q15. DOI: 10.1002/cpe.1791.

5. Krevat E, Vasudevan V, Phanishayee A, Andersen DG, Ganger GR, Gibson GA, Seshan S. On Application-level Approaches to Avoiding TCP Throuthput Collapse in Cluster-based Storage Systems. In Proceedings of 2ndinternational Petascale Data Storae Workshop (PDSW 07), NY, USA, November 2007; 14.

6. Alizadeh M, Atikoglu B, Kabbani A, Lakshmikantha A, Pan R, Prabhakar B, Seaman M. Data Center TransportMechanisms: Congestion Control Theory and IEEE Standardization. In Proc. 46th Annual Allerton Conference,

Illinois, USA, September 2008; 12701277.7. Alizadeh M, Greenberg A, Maltz D, Padhye J, Patel P, Prabhakar B, Sengupta S, Sridharan M. Data Center TCP(DCTCP) Q16. In Proceedings of ACM SIGCOMM, NY, USA, September 2010.

8. Wu H,FengZ, Guo C, Zhang Y. ICTCP: Incast Congestion Control forTCP in Data Center Networks. In Proceedingsof ACM CoNEXT, NY, USA, 2010.

9. Valarmathi K, Malmurugan N. Distributed multichannel assignment with congestion control in wireless meshnetworks.International Journal of Communication Systems;24:15841594. DOI: 10.1002/dac.1234, 2011.

10. 1Qau - Congestion notification. (Available from: http://www.ieee802.org/1/pages/802.1au.html). [Last Accessed:Oct. 13 2011].

11. Zhang C, Mamatas L, Tsaoussidis V. A study of deploying smooth- and responsive-TCPs with different queuemanagement schemes.International Journal of Communication SystemsMay 2009;22(5):513530.

12. 1Qbb Priority-based Flow Control. (Available from: http://www.ieee802.org/1/pages/802.1bb.html). [LastAccessed: Oct. 16, 2011].


DOI: 10.1002/dac


12/19

U

NCORR

ECTEDP

ROOF

01

02

03

04

05

06

0708

09

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

3233

34

35

36

37

38

39

40

41

42

43

4445

46

47

48

49

50

51

52

53

54

12 Y. REN ET AL.

13. Devkota P, Reddy ALN. Performance of Quantized Congestion Notification in TCP Incast Scenarios of Data Centers.In Proceedings of the 2010 18th IEEE/ACM International Symposium on Modeling, Analysis and Simulation ofComputer and Telecommunication Systems, (MASCOTS), Miami Beach, FL, 2010.

14. Kabbani A, Alizadeh M, Yasuda M, Pan R, Prabhakar B. AF-QCN:Approximate Fairness with Quantized CongestionNotification for Multi-tenanted Data Centers. In Proceedings of 18th IEEE Annual Symposium on High Performance

Interconnects (HOTI), Mountain View, CA, August 2010.15. Zhang Y, Ansari N. On mitigating TCP Incast in Data Center Networks. In Proceedings of INFOCOM, Shanghai,

China, 2011.16. Wadekar M. Enhanced Ethernet for Data Center: Reliable, Channelized and Robust. In Proceedings of 15th IEEEWorkshop on Local and Metropolitan Area Networks, NY, USA, June 2007; 6571.

17. CISCO white paper. Priority Flow Control: Build Reliable Layer 2 Infrastructure, June 2009Q17 .18. Vasudevan V, Phanishayee A, Shah H, Krevat E, Andersen DG, Ganger GR, Gibson GA, Mueller B. Safe and

Effective Fine-grained TCP Retransmissions for Datacenter Communication. In Proceedings of ACM SIGCOMM09 Barcelona, Spain, August 2009.

19. Zheng H, Chen C, Chunming C, Qiao AC. Understanding the Impact of Removing TCP Binary Exponential Backoffin Data Centers. In Proceedings of Third International Conference on Communications and Mobile Computing ,Qingdao, China, April 2011; 174177.

20. Ko E, An D, Yeom I, Yoon H. Congestion control for sudden bandwidth changes in TCP. International Journal ofCommunication Systems. DOI: 10.1002/dac.1322, 2011.

21. Hou T-C, Hsu C-W, WuC-S. A delay-based transport layer mechanism for fair TCP throughput over 802.11 multihopwireless mesh networks. International Journal of Communication Systems;24:10151032. DOI: 10.1002/dac.1207,2011.

22. Ren YM, Tang HN, Li J, Qian HL. Transport protocols for fast long distance networks.Journal of Software 2010;21(7):15761588.

23. Force10 E1200 switch. (Available from: http://www.nasi.com/force10_e1200.php).24. StewartMichael Tuxen RR, Neville-Neil GV. An Investigation into Data Center Congestion with ECN. InProceeding

of 2011 Technical BSD Conference (BSDCan 2011), Ottawa, CA, May 2011.25. Kulkarni S, Agrawal P. A Probabilistic Approach to Address TCP Incast in Data Center Networks. In Proceedings

of ICDCS W, MN, USA, June 2011; 2633.26. Zhang X, Liu G, Jiang S. Improve Throughput of Storage Cluster Interconnected with a TCP/IP Network Using

Intelligent Server Grouping. In Proceedings of 2010 IFIP international conference on Network and parallelcomputing, September 2010.

27. Mesnier MP, Waches M, Sambasivan RR, Zheng AX, Ganger GR. Modeling the Relative Fitness of Storage.InProceedings of ACM SIGMERICS international conference on Measurement and modeling of computer systems,NY, USA, June 2007.

28. Sun Microsystem, Inc. Lustre: A Scalable, High Performance File System. (Available from: http://www.lustre.org,2009).

29. Guo C, Haitao W, Tan K, Shi L, Zhang Y, Songwu L. Dcell: a scalable and fault-tolerant network structure for datacenters. InProceedings of ACM SIGCOMM 2008 conference on Data communication, NY, USA, 2008; 7586.

30. Guo C, Guohan L, Li D, Haitao W, Zhang X, Shi Y, Tian C, Zhang Y, Songwu L. BCube: a high performance,server-centric network architecture for modular data centers. In Proceedings of ACM SIGCOMM 2009 conferenceon Data communication, NY, USA, 2009; 6374.

31. Greenberg A, Jain N, kandula S, Kim C, Lahiri P, Maltz D, Patel P, Sengupta S. Vl2: a scalable and flexible datacenter network. In Proceedings of ACM SIGCOMM 2009 conference on Data communication, NY, USA, October2009.

AUTHORS BIOGRAPHIESQ18

Yongmao Ren

Yu Zhao

Pei Liu

Ke Dou

Jun Li


DOI: 10.1002/dac


13/19

Author Query Form

Journal: International Journal of Communication Systems

Article: dac_2402

Dear Author,

During the copyediting of your paper, the following queries arose. Please respond to these by annotating

your proofs with the necessary changes/additions.

If you intend to annotate your proof electronically, please refer to the E-annotation guidelines.

If you intend to annotate your proof by means of hard-copy mark-up, please refer to the proof mark-

up symbols guidelines. If manually writing corrections on your proof and returning it by fax, do

not write too close to the edge of the paper. Please remember that illegible mark-ups may delay

publication.

Whether you opt for hard-copy or electronic annotation of your proofs, we recommend that you provide

additional clarification of answers to queries by entering your answers on the query sheet, in additionto the text mark-up.

Query No. Query Remark

Q1 AUTHOR: Please provide a suitable figure (abstract dia-gram or illustration selected from the manuscript or anadditional eye-catching figure) and a short GTOCabstract (maximum 80 words or 3 sentences) summarizingthe key findings presented in the paper for Table of Content(TOC) entry.

Q2 AUTHOR: Transmission Control Protocol. Is this the cor-rect definition of TCP? Please change if incorrect.

Q3 AUTHOR: a deep was changed to an in-depth. Pleasecheck if this is correct.

Q4 AUTHOR: Transmission Control Protocol. Is this the cor-rect definition of TCP? Please change if incorrect.

Q5 AUTHOR: Please check all section heading levels if cor-rect.

Q6 AUTHOR: Please spell out CoSs.

Q7 AUTHOR: Please provide developer name, city, state (ifUS), country for NS.

Q8 AUTHOR: Please spell out CE.

Q9 AUTHOR: Please spell out CE.

Q10 AUTHOR: Please confirm if renumbering of equationsis OK as equation 8 was missing/skipped in the originalmanuscript.

Q11 AUTHOR: Please provide developer name, city, state (ifUS), country for Windows.


14/19

Q12 AUTHOR: Please check if changes to the first sentence ofSection 3.4.1 are correct.

Q13 AUTHOR: All occurrences of striping and sripedchanged to stripping and stripped, respectively. Pleasecheck if this is correct.

Q14 AUTHOR: Internet Protocol. is this the correct definitionof IP? Please change if incorrect.

Q15 AUTHOR: Please provide page range and volume numberfor Reference 4.

Q16 AUTHOR: Please provide page range for References 78,1315, 18, 24, 2627, 31.

Q17 AUTHOR: Please provide accessed date for References17, 23, 28.

Q18 AUTHOR: Please provide author biographies.


15/19

USING e-ANNOTATION TOOLS FOR ELECTRONIC PROOF CORRECTION

Required software to e-Annotate PDFs: Adobe Acrobat Professional or Adobe Reader (version 7.0 or

above). (Note that this document uses screenshots from Adobe Reader X)

The latest version of Acrobat Reader can be downloaded for free at: http://get.adobe.com/uk/reader/

Once you have Acrobat Reader open on your computer, click on the Commenttab at the right of the toolbar:

1.Replace (Ins)Tool for replacing text.

Strikes a line through text and opens up a text

box where replacement text can be entered.

How to use it

Highlight a word or sentence.

Click on the Replace (Ins) icon in the Annotations

section.

Type the replacement text into the blue box that

appears.

This will open up a panel down the right side of the document. The majority of

tools you will use for annotating your proof will be in theAnnotationssection,

pictured opposite. Weve picked out some of these tools below:

2. Strikethrough (Del)Tool for deleting text.

Strikes a red line through text that is to be

deleted.

How to use it

Highlight a word or sentence.

Click on the Strikethrough (Del) icon in the

Annotations section.

3. Add note to textTool for highlighting a section

to be changed to bold or italic.

Highlights text in yellow and opens up a text

box where comments can be entered.

How to use it

Highlight the relevant section of text.

Click on theAdd note to text icon in the


Type instruction on what should be changed

regarding the text into the yellow box that

appears.

4. Add sticky noteTool for making notes at

specific points in the text.

Marks a point in the proof where a comment

needs to be highlighted.

How to use it

Click on theAdd sticky note icon in the


Click at the point in the proof where the comment

should be inserted.

Type the comment into the yellow box that

appears.


16/19

USING e-ANNOTATION TOOLS FOR ELECTRONIC PROOF CORRECTION

For further information on how to annotate proofs, click on the Helpmenu to reveal a list of further options:

5.Attach FileTool for inserting large amounts of

text or replacement figures.

Inserts an icon linking to the attached file in the

appropriate pace in the text.

How to use it

Click on theAttach File icon in the Annotations

section.

Click on the proof to where youd like the attached

file to be linked.

Select the file to be attached from your computer

or network.

Select the colour and type of icon that will appear

in the proof. Click OK.

6.Add stampTool for approving a proof if no

corrections are required.

Inserts a selected stamp onto an appropriate

place in the proof.

How to use it

Click on theAdd stamp icon in the Annotations

section.

Select the stamp you want to use. (TheApproved

stamp is usually available directly in the menu that

appears).

Click on the proof where youd like the stamp to

appear. (Where a proof is to be approved as it is,

this would normally be on the first page).

7. Drawing MarkupsTools for drawing shapes, lines and freeform

annotations on proofs and commenting on these marks.Allows shapes, lines and freeform annotations to be drawn on proofs and for

comment to be made on these marks..

How to use it

Click on one of the shapes in the Drawing

Markupssection.

Click on the proof at the relevant point and

draw the selected shape with the cursor.

To add a comment to the drawn shape,

move the cursor over the shape until an

arrowhead appears.

Double click on the shape and type any

text in the red box that appears.


17/19

UNCORREC

TED

PROOF

S

WILEY AUTHOR DISCOUNT CLUB

We would like to show our appreciation to you, a highly valued contributor to Wileyspublications, by offering a unique 25% discountoff the published price of any of ourbooks*.

All you need to do is apply for the Wiley Author Discount Cardby completing theattached form and returning it to us at the following address:

The Database Group (Author Club)John Wiley & Sons LtdThe AtriumSouthern Gate

ChichesterPO19 8SQUK

Alternatively, you can register onlineat www.wileyeurope.com/go/authordiscountPlease pass on details of this offer to any co-authors or fellow contributors.

After registering you will receive your Wiley Author Discount Card with a special promotioncode, which you will need to quote whenever you order books direct from us.

The quickest way to order your books from us is via our European website at:

http://www.wileyeurope.com

Key benefits to using the site and ordering online include:Real-time SECURE on-line orderingEasy catalogue browsingDedicated Author resource centreOpportunity to sign up for subject-orientated e-mail alerts

Alternatively, you can order direct through Customer Services at:

[email protected], or call +44 (0)1243 843294, fax +44 (0)1243 843303

So take advantage of this great offer and return your completed form today.

Yours sincerely,

Verity LeaverGroup Marketing [email protected]

*TERMS AND CONDITIONSThis offer is exclusive to Wiley Authors, Editors, Contributors and Editorial Board Members in acquiring books for their personal use.

There must be no resale through any channel. The offer is subject to stock availability and cannot be applied retrospectively. Thisentitlement cannot be used in conjunction with any other special offer. Wiley reserves the right to amend the terms of the offer at anytime.


18/19

UNCORREC

TED

PROOF

S

To enjoy your 25% discount, tell us your areas of interest and you will receive relevant catalogues or leafletsfrom which to select your books. Please indicate your specific subject areas below.

AccountingPublicCorporate

[ ][ ][ ]

Architecture

Business/Management

[ ]

[ ]

Chemistry

AnalyticalIndustrial/SafetyOrganic

InorganicPolymerSpectroscopy

[ ][ ][ ][ ]

[ ][ ][ ]

Computer ScienceDatabase/Data WarehouseInternet BusinessNetworking

Programming/SoftwareDevelopmentObject Technology

[ ][ ][ ][ ]

[ ]

[ ]

Encyclopedia/ReferenceBusiness/FinanceLife Sciences

Medical SciencesPhysical SciencesTechnology

[ ][ ][ ][ ][ ][ ]

EngineeringCivilCommunications TechnologyElectronicEnvironmentalIndustrialMechanical

[ ][ ][ ][ ][ ][ ][ ]

Earth & Environmental Science

Hospitality

[ ]

[ ]

Finance/InvestingEconomics

InstitutionalPersonal Finance

[ ][ ]

[ ][ ]

GeneticsBioinformatics/

Computational BiologyProteomicsGenomicsGene MappingClinical Genetics

[ ][ ]

[ ][ ][ ][ ]

Life Science

Landscape Architecture

MathematicsStatistics

Manufacturing

Materials Science

[ ]

[ ]

[ ]

[ ]

[ ]

[ ]Medical ScienceCardiovascularDiabetesEndocrinologyImagingObstetrics/GynaecologyOncologyPharmacologyPsychiatry

[ ][ ][ ][ ][ ][ ]

[ ][ ][ ]

PsychologyClinicalForensicSocial & PersonalityHealth & SportCognitive

OrganizationalDevelopmental & Special EdChild WelfareSelf-Help

[ ][ ][ ][ ][ ][ ]

[ ][ ][ ][ ]

Non-Profit [ ] Physics/Physical Science [ ]

Please complete the next page /

REGISTRATION FORM

For Wiley Author Club Discount Card


19/19

UNCORREC

TED

PROOF

S

I confirm that I am (*delete where not applicable):

a WileyBook Author/Editor/Contributor* of the following book(s):

ISBN:ISBN:

a WileyJournal Editor/Contributor/Editorial Board Member* of the following journal(s):

SIGNATURE: Date:

PLEASE COMPLETE THE FOLLOWING DETAILS IN BLOCK CAPITALS:

TITLE: (e.g. Mr, Mrs, Dr) FULL NAME: .

JOB TITLE (or Occupation): ..

DEPARTMENT: ..

COMPANY/INSTITUTION:

ADDRESS:

TOWN/CITY:

COUNTY/STATE: .

COUNTRY: .

POSTCODE/ZIP CODE:

DAYTIME TEL:

FAX:

E-MAIL:

YOUR PERSONAL DATAWe, John Wiley & Sons Ltd, will use the information you have provided to fulfil your request. In addition, we would like to:

1. Use your information to keep you informed by post of titles and offers of interest to you and available from us or otherWiley Group companies worldwide, and may supply your details to members of the Wiley Group for this purpose.

[ ] Please tick the box if you do NOTwish to receive this information

2. Share your information with other carefully selected companies so that they may contact you by post with details of

titles and offers that may be of interest to you.

[ ] Please tick the box if you do NOTwish to receive this information.

E-MAIL ALERTING SERVICE

We also offer an alerting service to our author base via e-mail, with regular special offers and competitions. If you DOwish toreceive these, please opt in by ticking the box [ ].

If, at any time, you wish to stop receiving information, please contact the Database Group ([email protected]) at John Wiley & Sons Ltd,The Atrium, Southern Gate, Chichester, PO19 8SQ, UK.

TERMS & CONDITIONSThis offer is exclusive to Wiley Authors, Editors, Contributors and Editorial Board Members in acquiring books for their personal use. There should

be no resale through any channel. The offer is subject to stock availability and may not be applied retrospectively. This entitlement cannot be usedin conjunction with any other special offer. Wiley reserves the right to vary the terms of the offer at any time.

PLEASE RETURN THIS FORM TO:

Database Group (Author Club), John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, PO19 8SQ, UK [email protected]: +44 (0)1243 770154

2012 a survey on tcp incast in data center network-libre

Documents