nanjing university · 123 author's personal copy. 96 y.wangetal. hierarchical clustering to...

12
1 23 Telecommunication Systems Modelling, Analysis, Design and Management ISSN 1018-4864 Volume 73 Number 1 Telecommun Syst (2020) 73:95-104 DOI 10.1007/s11235-019-00601-8 Providing overlay-based multicast in data center networks with optional millimeter wavelength links Yi Wang, Bingquan Wang, Chen Tian & Fu Xiao

Upload: others

Post on 27-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Nanjing University · 123 Author's personal copy. 96 Y.Wangetal. hierarchical clustering to accurately infer the topology in DC networks even in the case of a hybrid of wired and

1 23

Telecommunication SystemsModelling, Analysis, Design andManagement ISSN 1018-4864Volume 73Number 1 Telecommun Syst (2020) 73:95-104DOI 10.1007/s11235-019-00601-8

Providing overlay-based multicast in datacenter networks with optional millimeterwavelength links

Yi Wang, Bingquan Wang, Chen Tian &Fu Xiao

Page 2: Nanjing University · 123 Author's personal copy. 96 Y.Wangetal. hierarchical clustering to accurately infer the topology in DC networks even in the case of a hybrid of wired and

1 23

Your article is protected by copyright and

all rights are held exclusively by Springer

Science+Business Media, LLC, part of

Springer Nature. This e-offprint is for personal

use only and shall not be self-archived in

electronic repositories. If you wish to self-

archive your article, please use the accepted

manuscript version for posting on your own

website. You may further deposit the accepted

manuscript version in any repository,

provided it is only made publicly available 12

months after official publication or later and

provided acknowledgement is given to the

original source of publication and a link is

inserted to the published article on Springer's

website. The link must be accompanied by

the following text: "The final publication is

available at link.springer.com”.

Page 3: Nanjing University · 123 Author's personal copy. 96 Y.Wangetal. hierarchical clustering to accurately infer the topology in DC networks even in the case of a hybrid of wired and

Telecommunication Systems (2020) 73:95–104https://doi.org/10.1007/s11235-019-00601-8

Providing overlay-based multicast in data center networks withoptional millimeter wavelength links

Yi Wang1 · Bingquan Wang2 · Chen Tian2 · Fu Xiao3

Published online: 26 July 2019© Springer Science+Business Media, LLC, part of Springer Nature 2019

AbstractEfficient multicast support is very important in data center (DC) networks, as many big data systems in data centers arebandwidth-hungry. Multicast implementations in DC networks are usually overlay-based. The major challenges are how canwe accurately infer the topology of DC networks with both wired and wireless millimeter wavelength (MMW) links and howto design multicast algorithm for these topologies. In this paper, we first use hierarchical clustering to accurately infer thetopology. Then, we propose the Inter-Rack First Multicast (IRFM) algorithm to match the fan-in nature of DC topologies.Evaluation results demonstrate that IRFM is 3.7−11.2× faster than naive multicast implementations in the pure wired case,and 4.8−14.6× faster in the case of MMW enhancement.

Keywords Data center networks ·Millimeter wave · 60 GHz wireless · Multicast

1 Introduction

Efficient multicast (i.e., one-to-many communication) sup-port is very important in data center (DC) networks [1–3].In nowadays big data systems, more and more applica-tions need to multicast massive amounts of data to othernodes. These applications include: publish-subscribe ser-vices for data dissemination [4], web cache updates [5],scatter-gather by iterative optimization algorithms [6] aswell as fragment-replicate joins in Hadoop [7]. Therefore,efficient multicast algorithm can significantly improve theperformance of applications.

B Chen [email protected]

Yi [email protected]

Bingquan [email protected]

Fu [email protected]

1 School of Modern Posts, Nanjing University of Posts andTelecommunications, Nanjing 210046, China

2 State Key Laboratory for Novel Software Technology,Nanjing University, Nanjing 210023, China

3 School of Computer Science, Nanjing University of Posts andTelecommunications, Nanjing 210046, China

Multicast implementations in DC networks are usuallyoverlay-based. As we know, L2 or L3 multicast supports areusually disabled formanagement reasons inDCnetworks [8].So, we target overlay-basedmulticast algorithm. An intuitiveoption is to mimic BitTorrent-like protocols [9] which accel-erate the dissemination process by split the data to chunks andperform chunk-level scheduling [10]. This approach severelyincreases CPU load which could have been used for moreimportant computations in DC [11]. Instead, we let the datatransferred as a whole from one node to another.

There are two major challenges. First, how can we accu-rately infer the topology of DC networks? The topology ofDC networks is very important for overlay-based multicastalgorithms. Common wired architectures are tree [12], leaf-spine [13] and fat-trees [14] or Clos networks [15,16]. How-ever, Millimeter wavelength wireless technology (MMW)[17–19] is used to further accelerate DC networks [20–22]in recent years. These wireless links constantly change byMicro-Electro-Mechanical Systems (MEMS). As a result,The network topology becomes so dynamic that how to accu-rately infer the topology of DC networks is a challenge.Second, how to design multicast algorithms for these topolo-gies.

In this paper, we target overlay-based multicast for bigdata systems. The scheduling objective of our multicastalgorithm is to minimize the average multicast completiontime (MCCT). We make three contributions. First, we use

123

Author's personal copy

Page 4: Nanjing University · 123 Author's personal copy. 96 Y.Wangetal. hierarchical clustering to accurately infer the topology in DC networks even in the case of a hybrid of wired and

96 Y. Wang et al.

hierarchical clustering to accurately infer the topology inDC networks even in the case of a hybrid of wired andwireless. Second, we propose the Inter-Rack First Multicast(IRFM) algorithm to match the fan-in nature of DC topolo-gies and derive optimal parameter settings for two differentscenarios. Third, we conduct large-scale NS3 simulations toevaluate the performance of our algorithm. The results showthat our multicast scheme (IRFM) is 3.7−11.2× faster thannaive multicast implementations in the pure wired case, and4.8−14.6× faster in the case of millimeter wavelength wire-less technology enhancement.

The rest of this paper is organized as follows. Section 2discusses background and related works. Section 3 presentsthe design of our system. Then we discuss topology infer-ence and Inter-Rack First Multicast (IRFM) in Sects. 4 and 5respectively. Finally, we conclude in Sect. 6.

2 Background and intuition

2.1 60GHz links in the DC networks

In recent years, the millimeter wavelength wireless technol-ogy (MMW) has gradually matured. Spectrum between 57and 64GHz, colloquially known as the 60GHz band, is avail-able world-wide for unlicensed use [23]. There are more andmore DC adopt 60GHz wireless technologies [24–26]. The60GHz wireless links can afford multi-Gbps data rates forDC. Furthermore, 60GHz devices with directional antennascan be deployed densely which is suitable for DC networks,because the signal attenuates rapidly due to the high fre-quency [23]. And the most important things are the lowpower characteristics and adjustable direction of wirelesslinks [27,28]. Wireless links not only reduce a lot of the has-sle of wiring, but they are also more flexible [29]. MMWcommunication technology is complementary to traditionalwired DC networks [30].

2.2 Overlay-basedmulticast in DC networks

Dan et al. [31] focus on routing strategy to prevent multicastloops.However, our goal is tominimize the averagemulticastcompletion time (MCCT). And we need to take note of thefact that most DC operators do not enable multicast featuresin their networks for scalability and stability reasons.

Orchestra manages data transfers in computer clusters[32]. For multicast transfers, Orchestra implements an opti-mized BitTorrent-like protocol called Cornet, augmentedby an adaptive clustering algorithm to take advantage ofthe hierarchical network topology in many data centers.BitTorrent-like protocol splits data into blocks and subdi-vides blocks into small chunks. This approach adds burdenon CPUwhen transfer large data simultaneously. In addition,

Fig. 1 System architecture

orchestra did not consider the architecture of the wired andwireless hybrid DC networks in which the changing wirelessantenna direction may affect network topology seriously.

2.3 Other related

FreeFlow [33] implements a container networking library tomake communication more efficient no matter which net-work API set is used. The authors consider the scenario thatboth sender and receiver containers reside on the same host.FreeFlow uses shared memory to accelerate the data trans-mission. Shared memory is a particularly efficient way totransfer data between two nodes on the same host. Althoughit is possible that themulticast sourcenode and thedestinationnode may be on the same host during the multicast process.In this paper, we only consider that sender and receiver resideon different hosts and leave the above scenario to futurework.

Although the topologies vary with the orientation of theMMWwireless antenna in wired andwireless hybrid DC, wecan still infer almost real-time topology information. Basedon the topology of DC networks, we target a more efficientoverlay-based multicast.

3 System design

Tomanage and optimize data transfers in the wholemulticastprocess, we design the multicast system as shown in Fig. 1.The key idea of the system is to separate the multicast con-trol plane from the data plane, including the multicast mastercontrol node and the slave data transmission node. Each slavenode has a multicast agent. The multicast API is Bool mul-ticast(source node, multicast group, data). An applicationsimply calls the API. The multicast agent sends the requestto a centralized multicast master node. The master nodehas a topology inference module and a multicast schedulermodule. Since the direction of the MMW wireless antennachanges constantly with the change of the DC workload, sothe topology inference module infers the DC networks topol-ogy among nodes via historical transfer throughput amongnodes periodically (Sect. 4.1).

123

Author's personal copy

Page 5: Nanjing University · 123 Author's personal copy. 96 Y.Wangetal. hierarchical clustering to accurately infer the topology in DC networks even in the case of a hybrid of wired and

Providing overlay-based multicast in data center networks with optional millimeter wavelength links 97

The scheduler coordinates the whole multicast process.Based on the network topology, it instructs the source nodeto send all data to one or several destination nodes first.Whena destination node receives all data, it becomes a seed. Thescheduler then instructs the seed to transfer data to other des-tination nodes on behalf of the original source. When thesource or a seed finish one transfer, it might be scheduledwith another destination. So on and so forth, until all desti-nation nodes receive the data. The scheduling algorithm ofthe scheduler is shown in Sect. 5.

4 Topology inference

4.1 Design philosophy

Many data centers employ hierarchical network topologieswith oversubscription [15]. For the same amount of data, thetransfer time between two nodes in different racks (i.e., inter-rack communication) is much higher than that of two nodesin the same rack (i.e., intra-rack communication).

The MMW wireless antenna on the top of the rack canalleviate the hot spots between different racks to a certainextent because MMWwireless links increase the bandwidthbetween the racks. However, theMMWwireless links do notaffect the wired topology of the DC networks. In this paper,we call the wired topology the physical topology (rack level).Without loss generality, we assume there is a wireless linkbetween rack 1 and rack 2, then the bandwidth between rack1 and rack 2 will be increased. From a hierarchical topologyperspective, these two racks can be viewed as a super-rack.We can use hierarchical clustering to accurately infer thephysical topology (rack level) and the super-rack topology(super-rack level).

With topology information, the scheduler module cancoordinate the multicast process more efficient [34]. For thisreason, we implement a topology inference module, similarto previous work [32], but we added hierarchical clusteringto adapt to the MMW wireless environment between racks.

The steps for topology inference are as follows:

– We use (historical) node to node transfer throughputrecords to construct an n × n sparse distance matrix D,where n is the number of nodes in the topology, and theentries are the median transfer throughput between twonodes [35,36].

– Missing entries are inferred in the distance matrix usingnon-negative matrix factorization procedure of Mao andSaul [37]. Non-negative matrix factorization (NMF) islinear dimensionality reduction that can be applied to thedistance matrix D.

– After completing thematrix D, we project the nodes ontoa two-dimensional space using non-metric multidimen-

16 hostsRack 2

16 hostsRack 1

16 hostsRack 3

16 hostsRack 4

16 hostsRack 5

Fig. 2 The simulated data center topology

-1.0 -0.5 0.0 0.5 1.0 1.5

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

Classification

Fig. 3 The classification of the nodes in the data center networks with-out background flows from the perspective of rack level

sional scaling, or NMDS [38]. The goal of NMDS is tocollapse information from multiple dimensions into justa few, so that they can be visualized and interpreted.

– Finally, we cluster nodes using a mixture of sphericalGaussian. By setting different thresholds, we can get thephysical topology (rack level) and the super-rack topol-ogy (super-rack level).

With enough training data, the above topology inferencealgorithm can infer the network topology accurately, as wewill show in Sect. 4.2. We expect this implementation canbe easily extended to topologies with more than two switchlayers.

4.2 Topology inference evaluation

The simulated topologyweadopt is shown inFig. 2. There aretotal 80 hosts, with 16 hosts in each rack. All 5 racks switchesconnect to 1 spine switch. Also, there is a directional antennaon the top of every rack,which can provide 5Gbps bandwidthon a 60GHz carrier wave. Every wired link has a bandwidth

123

Author's personal copy

Page 6: Nanjing University · 123 Author's personal copy. 96 Y.Wangetal. hierarchical clustering to accurately infer the topology in DC networks even in the case of a hybrid of wired and

98 Y. Wang et al.

-2 -1 0 1

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

Classification

Fig. 4 The classification of the nodes in the data center networks with80 random background flows from the perspective of rack level

-1.0 -0.5 0.0 0.5 1.0 1.5

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

Classification

Fig. 5 The classification of the nodes in the data center networks with-out background flows from the perspective of super-rack level

of 10Gbps and the link delay is 1 us. The direction of theMMWwireless antenna varies with the DC workload so it isout of our control.

Without loss of generality, we assume that rack 1 andrack 2, rack 3 and rack 4 are connected by wireless links.Rack 5 does not turn on wireless. We randomly generate80× 80× 0.5 data flows, and the size of each flow is 1MB.We record the transfer time of each flow, so that matrix D is50% full. As mentioned before, we infer the missing data in

-2 -1 0 1

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

Classification

Fig. 6 The classification of the nodes in the data center networks with80 random background flows from the perspective of super-rack level

the matrix using the non-negative matrix factorization. Weproject the nodes onto a two-dimensional space [38] usingthe isoMDS function of the MASS library in R. Here we usemclust which is an R package for normal mixture modelingvia EM, model-based clustering, classification, and densityestimation [39,40].

The results, without and with concurrent backgroundflows, are shown in Figs. 3 and 4 respectively. The ellipses inthe figure represent the inferred clusters. The different shapesin the figure represent different racks. From Figs. 3 and 4, wecan see that our topology inference algorithm can accuratelyinfer the rack level network topology. However, as shown inFigs. 5 and 6, we found that the two racks (rack 1 and rack2) in the bottom left corner of the figure can be classifiedinto one cluster and regarded as a super-rack by changingthe threshold for higher level clustering. The other two racks(rack 3 and rack 4) in the top right corner is also a super-rackand the rack 5 in the bottom right corner is a super-rack on itsown. Through hierarchical clustering, we can get the topolo-gies of different levels in the DC networks which reflects theMMWwireless links between the racks. From Figs. 5 and 6,we can see there are 3 super-racks in the topology comparedto the 5 racks in Figs. 3 and 4.

We can conclude that whether or not there are backgroundflows, the algorithm can infer the physical topology and thesuper-rack topology of the network accurately. Since thenodes in a super-rack are very close to each other, the super-rack is then taken as a whole and called rack in the followingfor the sake of simplicity.

123

Author's personal copy

Page 7: Nanjing University · 123 Author's personal copy. 96 Y.Wangetal. hierarchical clustering to accurately infer the topology in DC networks even in the case of a hybrid of wired and

Providing overlay-based multicast in data center networks with optional millimeter wavelength links 99

Super-rack Super-rackSuper-rack

Fig. 7 The Inter-Rack First Multicast (IRFM) algorithm

5 Multicast algorithm

5.1 Inter-Rack First Multicast

Inter-Rack First Multicast (IRFM) leverages the fact that thetransfer times between two nodes in different racks is sig-nificantly higher than between nodes in the same rack. Atthe beginning of multicast transmission, IRFM first transmitdata to the nodes in other racks. This approach lets each rackto have a copy of data within the shortest time. Then the datacan be transferred inside each rack, which not only increasesparallelism and can reduce MCCT greatly, but also free upthe bandwidth between racks which can do something morevaluable.

We also need to fully utilize the bandwidth of uploadingnodes. Inter-rack links can be congested. We propose twosolutions to solve this problem. First, we use theMMWwire-less antenna on the top of racks to increase the bandwidthbetween the racks. Second, when we perform inter-racktransfer, we should transfer data to intra-rack nodes at thesame time. This scheme can exploit the remaining uploadbandwidth of source/seeds.

The Inter-RackFirstMulticast (IRFM) algorithm is shownin Fig. 7, and the details of the IRFM is shown inAlgorithm1.

Algorithm 1 Inter-Rack First MulticastStep 1. At the beginning of multicast transmission, the multicastsource node (blue) transfers data to m intra-rack (orange in the samesuper-rack) and n inter-rack (orange in the different super-rack) des-tination nodes at the same time.Step 2. As long as there has a node which have completed the datatransmission, it becomes a seed and performs the same action as Step1 until all racks have at least one copy of the data.Step 3. Multicast source node and all seeds only transfer data to intra-rack nodes until the whole multicast process is completed.

5.2 Guidelines for choosing parameters

In this section, we discuss the parameter selection underdifferent conditions. The first is Exclusive, where a singlemulticast can use the whole network bandwidth. There is noconcurrent multicast sessions and background flows. Thisscenario is for analysis and comparison purpose only. The

Table 1 Different multicast algorithms

Multicast algorithms Description

Collateral multicast (CM) The multicast source node transferdata to multicast destination nodessimultaneously

Sequential multicast (SM) The multicast source node transferdata to multicast destination nodessequentially

Random multicast (RM) The multicast source node transferdata to a multicast destination noderandomly, then the node transfer datato another node randomly when thenode complete, no matter source ordestination node

Inter-Rack First Multicast(IRFM)

The multicast source node transferdata to m intra-rack and n inter-rackdestination nodes at the same time inthe pure wired case

Inter-Rack First Multicastwith Wireless (IRFMW)

The multicast source node transferdata to m intra-rack and n inter-rackdestination nodes at the same time inthe case of MMW enhancement

second is Realistic, where multicast sessions can be paralleland there exist other background flows. In all scenarios, wegive inter-rack traffic higher priority than intra-rack traffic.

5.2.1 Exclusive scenario

In this scenario, we prove that the optimal parameter settingis that the values of m and n are both 1. We prove it asfollowing.

Without loss generality, we assume the transfer proceedsin rounds.We assume that themulticast source node transfersdata to k other racks at the first round. We assume that thetimebetween twonodes’ transmission is 1 timeunit ifwithoutcompetition. As there are no background flows, transfer to knodes will share the bandwidth equally and the transmissiontime is k. When the first round completes, there are k + 1nodes have the data, i.e., one source and k seeds. After thesecond round, there are (k + 1)2 nodes have the data, andso on and so forth. Thus, if we want to multicast data to Ndifferent rack in the shortest time, we can get the equation:

(k + 1)t = N , (1)

where t represent the total transfer rounds we need. Hence,to multicast data to all N nodes we need

T = logk+1 N · k. (2)

From Eq. 2, we can discover that when the total multicastnodes N is a constant value, the larger the number of con-current transmission k, the longer the multicast completion

123

Author's personal copy

Page 8: Nanjing University · 123 Author's personal copy. 96 Y.Wangetal. hierarchical clustering to accurately infer the topology in DC networks even in the case of a hybrid of wired and

100 Y. Wang et al.

0

20

40

60

80A

vera

ge M

CC

T(m

s)

0

200

400

600

800

Ave

rage

MC

CT(

ms)

0

2000

4000

6000

8000

Ave

rage

MC

CT(

ms)

(a) 1MB

CM SM RM

IRFM(m=n=1)

IRFM(m=n=2)

IRFM(m=n=3)

IRFM(m=n=4)IRFMW

(b) 10MB

CM SM RM

IRFM(m=n=1)

IRFM(m=n=2)

IRFM(m=n=3)

IRFM(m=n=4)IRFMW

(c) 100MB

CM SM RM

IRFM(m=n=1)

IRFM(m=n=2)

IRFM(m=n=3)

IRFM(m=n=4)IRFMW

Fig. 8 The average MCCT without background flows with a singlemulticast

time (MCCT) T . It is trivial to extend the analysis to intra-rack transfers. In conclusion, when there are no backgroundflows, the optimal parameters for IRFM is that the values ofthe m and n are both 1.

5.2.2 Realistic scenario

In this case, we assume that all background flows and multi-cast flows share the bandwidth. Since now background flowsand inter-rack multicast flows are equally sharing the band-width, we choose the value of n as large as possible to getbandwidth as much as possible.Wemake n equal to the num-ber of the multicast destination nodes racks minus 1 (i.e.,except the rack with the source node).

(a) 1MB

CM SM RM

IRFM(m=n=1)

IRFM(m=n=2)

IRFM(m=n=3)

IRFM(m=n=4)IRFMW

0

50

100

150

Ave

rage

MC

CT(

ms)

(b) 10MB

CM SM RM

IRFM(m=n=1)

IRFM(m=n=2)

IRFM(m=n=3)

IRFM(m=n=4)IRFMW

0

200

400

600

800

1000

1200

Ave

rage

MC

CT(

ms)

(c) 100MB

CM SM RM

IRFM(m=n=1)

IRFM(m=n=2)

IRFM(m=n=3)

IRFM(m=n=4)IRFMW

0

2000

4000

6000

8000

10000

12000

Ave

rage

MC

CT(

ms)

Fig. 9 The averageMCCTwithout background flows with two concur-rent multicasts

For intra-rack parameter m, we adopt an adaptive strat-egy. A source/seed monitors its uplink bandwidth utilizationperiodically. Once it found there are idle bandwidth, whichresults from the congested inter-rack links, the host increasesone multicast transfer to another node in the same rack.

5.3 Inter rack first multicast (IRFM) evaluation

In this section, we conduct large-scale NS3 simulations toevaluate the performance of IRFM. The topology of the net-work we used in NS3 simulations is shown in Fig. 2. Thenetwork settings in simulation are the same as in Sect. 4.2.

Several heuristic algorithms are also evaluated for com-parison, as shown in Table 1. We implement the above

123

Author's personal copy

Page 9: Nanjing University · 123 Author's personal copy. 96 Y.Wangetal. hierarchical clustering to accurately infer the topology in DC networks even in the case of a hybrid of wired and

Providing overlay-based multicast in data center networks with optional millimeter wavelength links 101

0 50 100 150 200 250 300 350

Number of flows

101

102

103

104A

vera

ge M

CC

T(m

s)

CMSMRMIRFM(n=racks-1)IRFM(m=n=1)IRFM(m=n=2)IRFM(m=n=3)IRFM(m=n=4)IRFMW

Fig. 10 The average MCCT with background flows, 1MB, 1 multicast

algorithms in NS3 simulations under exclusive and realisticscenarios discussed in Sect. 5.2. The multicast settings(e.g.,multicast source node and multicast destination nodes) arethe same in those algorithms. We compare the MCCT ofdifferent multicast algorithms.

5.3.1 Multicast in exclusive case

When there is only one host multicast data to other 79 hosts,Fig. 8 shows the multicast completion time (MCCT) of dif-ferent multicast algorithms to transfer 1MB (Fig. 8a), 10MB(Fig. 8b) and 100MB (Fig. 8c), respectively.

As shown in Fig. 8, IRFM is superior to other multicastalgorithms. IRFM is 3.7−11.2× faster compared with otheralgorithms. For IRFM, the parametersm and n are both 1, aswe proved in Sect. 5.2. This setting is 1.3−1.7× faster com-paredwith other parameters. Because the bandwidth betweenthe racks is not the bottleneck in exclusive case, so the resultsunder MMW enhancement (IRFMW) is the same as it waswhen we didn’t use it.

We also test when there are two hosts in different racksand each would multicast data to other 79 hosts. The resultsare shown in Fig. 9. It’s similar to Fig. 8 except the averageMCCTofCM.The reason is that the only bottleneck ofCM isthe uplink of the source node. Because two multicast source

0 50 100 150 200 250 300 350

Number of flows

101

102

103

104

Ave

rage

MC

CT(

ms)

CMSMRMIRFM(n=racks-1)IRFM(m=n=1)IRFM(m=n=2)IRFM(m=n=3)IRFM(m=n=4)IRFMW

Fig. 11 The averageMCCTwith background flows, 1MB, 2multicasts

hosts are in different racks, so they are independent to eachother and the average CMMCCT is the same as that in Fig. 8.For other multicast algorithms, the two nodes multicast databy sharing bandwidth, so the average MCCT is doubled.

5.3.2 Multicast in realistic case

Figure 10 shows that the proposed multicast algorithmIRFM can reduceMCCT greatly even when there exist back-ground flows. IRFM is 2.0−44.2× faster compared withother algorithmswhen transfer 1MBdata. In addition,MMWlinks between racks can further improve multicast perfor-mance because of the bottlenecks between the racks. IRFMWis 1.3× faster compared with IRFM and 2.7−58.6× fastercomparedwith other naive algorithms.Andwe also testwhenthere are two concurrent multicast transfers (Fig. 11).We canconclude that each multicast algorithm is quite stable. Byusing millimeter wavelength wireless technology (MMW)in traditional wired DC, we further improve the performanceof multicast in realistic case.

There is no significant difference for choosing any param-eters for m of IRFM in realistic case. The reason is thatIRFM detects the bandwidth utilization of the node, andadaptively change the number of concurrent transfers. Thesmall data size of transfer data alleviates the advantage of this

123

Author's personal copy

Page 10: Nanjing University · 123 Author's personal copy. 96 Y.Wangetal. hierarchical clustering to accurately infer the topology in DC networks even in the case of a hybrid of wired and

102 Y. Wang et al.

40flows 80flowsNumber of flows

0

500

1000

1500

2000

2500

3000

3500A

vera

ge M

CC

T(m

s)

IRFM(n=racks-1)IRFM(m=n=1)IRFM(m=n=2)IRFM(m=n=3)IRFM(m=n=4)

Fig. 12 The average MCCT with background flows, 100MB, 1 multi-cast

40flows 80flowsNumber of flows

0

500

1000

1500

2000

2500

3000

Ave

rage

MC

CT(

ms)

IRFM(n=racks-1)IRFM(m=n=1)IRFM(m=n=2)IRFM(m=n=3)IRFM(m=n=4)

Fig. 13 The average MCCT with background flows, 100MB, 2 multi-casts

algorithm. Therefore, we multicast mass of data (100M) toevaluate the performance of IRFM. As shown in Figs. 12and 13, we select parameters n equals to the number ofrack − 1, as recommended in Sect. 5.2. With the aboveresults, we can conclude that the experiments are in accor-dance with the theory.

6 Conclusions

In this paper, we not only proposed a multicast algorithmInter-Rack First Multicast (IRFM) for big data systems inDC networks. But also, we take advantage of millimeterwavelength wireless technology (MMW) to further accel-erate DC networks. Our work was motivated by multicastin DC networks, which is widespread in many applications

(e.g., HDFS [41]). We implement the multicast API in theapplication layer. The multicast algorithm IRFM make fulluse of the bandwidth and the topology in the data center net-works. Our experiments show that our multicast algorithmIRFM is 3.7−11.2× faster than the naive multicast imple-mentations in the pure wired case, and 4.8−14.6× faster inthe case of MMW enhancement.

Acknowledgements The authors would like to thank anonymousreviewers for their valuable comments. This research is supported by theNational Key R&D Program of China 2018YFB1003505, the NationalNatural Science Foundation of China under Grant Nos. 61602194,61772265, and 61802172, the Collaborative Innovation Center of NovelSoftware Technology and Industrialization, and the Jiangsu Innovationand Entrepreneurship (Shuangchuang) Program.

Compliance with ethical standards

Conflict of interest On behalf of all authors, the corresponding authorstates that there is no conflict of interest.

References

1. Li, D., Cui, H., Hu, Y., Xia, Y., & Wang, X. (2011). Scalable datacenter multicast using multi-class bloom filter. In 2011 19th IEEEinternational conference on network protocols (pp. 266–275). NewYork: IEEE.

2. Li, D., Li, Y., Wu, J., Su, S., & Esm, J. Y. (2012). Efficient andscalable data center multicast routing. IEEE/ACM Transactions onNetworking (TON), 20(3), 944–955.

3. Li, X.,&Freedman,M. J. (2013) Scaling IPmulticast on datacentertopologies. In Proceedings of the 9th ACM conference on emergingnetworking experiments and technologies (pp. 61–72). New York:ACM.

4. Data Distribution Service. http://portals.omg.org/dds/.5. Fan, L., Cao, P., Almeida, J., & Broder, A. Z. (2000). Sum-

mary cache: A scalable wide-area web cache sharing protocol.IEEE/ACM Transactions on Networking (TON), 8(3), 281–293.

6. Zhou, Y., Wilkinson, D., Schreiber, R., & Pan, R. (2008). Large-scale parallel collaborative filtering for the netflix prize. In Interna-tional conference on algorithmic applications in management (pp.337–348). Berlin: Springer.

7. Gates, A. F., Natkovich, O., Chopra, S., Kamath, P., Narayana-murthy, S. M., Olston, C., et al. (2009). Building a high-leveldataflow system on top of map-reduce: The pig experience. Pro-ceedings of the VLDB Endowment, 2(2), 1414–1425.

8. McBride, M., & Liu, H. (2012). Multicast in the data cen-ter overview. https://datatracker.ietf.org/doc/draft-ietf-mboned-dc-deploy/.

9. Chokkalingam, A., & Riyaz, F. (2004). BitTorrent protocol speci-fication v1. 0. CSI 5321.

10. Qiu, D., & Srikant, R. (2004). Modeling and performance analysisof BitTorrent-like peer-to-peer networks. InACMSIGCOMMcom-puter communication review (Vol. 34, pp. 367–378). New York:ACM.

11. Beloglazov,A.,&Buyya,R. (2010). Energy efficient resourceman-agement in virtualized cloud data centers. In Proceedings of the2010 10th IEEE/ACM international conference on cluster, cloudand grid computing (pp. 826–831). New York: IEEE ComputerSociety.

123

Author's personal copy

Page 11: Nanjing University · 123 Author's personal copy. 96 Y.Wangetal. hierarchical clustering to accurately infer the topology in DC networks even in the case of a hybrid of wired and

Providing overlay-based multicast in data center networks with optional millimeter wavelength links 103

12. Al-Fares, M., Loukissas, A., & Vahdat, A. (2008). A scalable,commodity data center network architecture. In ACM SIGCOMMcomputer communication review (Vol. 38, pp. 63–74). New York:ACM.

13. Alizadeh, M., & Edsall, T. (2013). On the data path performance ofleaf-spine datacenter fabrics. In 2013 IEEE 21st annual symposiumon high-performance interconnects (pp. 71–74). New York: IEEE.

14. Petrini, F., & Vanneschi, M. (1997). k-ary n-trees: High perfor-mance networks for massively parallel architectures. In Proceed-ings 11th international parallel processing symposium (pp. 87–93).New York: IEEE.

15. Greenberg, A., Hamilton, J. R., Jain, N., Kandula, S., Kim, C.,Lahiri, P., et al. (2009). Vl2: A scalable and flexible data centernetwork. In ACM SIGCOMM.

16. Singh, A., Ong, J., Agarwal, A., Anderson, G., Armistead, A., Ban-non, R., et al. (2015). Jupiter rising:A decade of clos topologies andcentralized control in Google’s datacenter network. In Proceedingof the ACM SIGDC 2015 (pp. 183–197). New York: ACM.

17. Adhikari, P. (2008).Understanding millimeter wave wireless com-munication. San Diego: Loea Corporation.

18. Rappaport, T. S., Heath, R. W, Jr., Daniels, R. C., & Murdock,J. N. (2014). Millimeter wave wireless communications. London:Pearson Education.

19. Shi, J.-W., Huang, C.-B., & Pan, C.-L. (2011). Millimeter-wavephotonic wireless links for very high data rate communication.NPG Asia Materials, 3(4), 41.

20. Bhattacharyya, S., Keshav, S., & Seth, A. (August 3, 2010).Oppor-tunistic data transfer over heterogeneous wireless networks. USPatent 7,769,887.

21. Katayama, Y., Takano, K., Kohda, Y., Ohba, N., & Nakano,D. (2011). Wireless data center networking with steered-beammmwave links. In 2011 IEEE wireless communications and net-working conference (pp. 2179–2184). New York: IEEE.

22. Vardhan, H., Thomas, N., Ryu, S. R., Banerjee, B., & Prakash,R. (2010). Wireless data center with millimeter wave network. In2010 IEEE global telecommunications conference GLOBECOM2010 (pp. 1–6). New York: IEEE.

23. Halperin, D., Kandula, S., Padhye, J., Bahl, P., & Wetherall, D.(2011). Augmenting data center networks with multi-gigabit wire-less links. In ACM SIGCOMM computer communication review(Vol. 41, pp. 38–49). New York: ACM.

24. Hamedazimi, N., Qazi, Z., Gupta, H., Sekar, V., Das, S. R., Longtin,J. P., et al. (2014). Firefly: A reconfigurable wireless data centerfabric using free-space optics. In ACM SIGCOMM computer com-munication review (Vol. 44, pp. 319–330). New York: ACM.

25. Wang, X., Kong, L., Kong, F., Qiu, F., Xia, M., Arnon, S., et al.(2018).Millimeterwave communication:A comprehensive survey.IEEE Communications Surveys and Tutorials, 20(3), 1616–1653.

26. Zhou, X., Zhang, Z., Zhu, Y., Li, Y., Kumar, S., Vahdat, A., et al.(2012). Mirror mirror on the ceiling: Flexible wireless links fordata centers. ACM SIGCOMM Computer Communication Review,42(4), 443–454.

27. Jiang, D., Huo, L., Lv, Z., Song, H., & Qin, W. (2018). A jointmulti-criteria utility-based network selection approach for vehicle-to-infrastructure networking. IEEE Transactions on IntelligentTransportation Systems, 99, 1–15.

28. Jiang, D., Li, W., & Lv, H. (2017). An energy-efficient cooperativemulticast routing in multi-hop wireless networks for smart medicalapplications. Neurocomputing, 220, 160–169.

29. Cui, Y., Wang, H., Cheng, X., Li, D., & Ylä-Jääski, A. (2013).Dynamic scheduling for wireless data center networks. IEEETransactions on Parallel and Distributed Systems, 24(12), 2365–2374.

30. Yi,X., Liu, F., Liu, J.,& Jin,H. (2014). Building a network highwayfor big data: Architecture and challenges. IEEE Network, 28(4), 5–13.

31. Li, D., Yu, J., Yu, J., & Wu, J. (2011). Exploring efficient andscalable multicast routing in future data center networks. In INFO-COM, 2011 proceedings IEEE (pp. 1368–1376). New York: IEEE.

32. Chowdhury, M., Zaharia, M., Ma, J., Jordan, M. I., & Stoica, I.(2011). Managing data transfers in computer clusters with orches-tra. In ACM SIGCOMM computer communication review (Vol. 41,pp. 98–109). New York: ACM.

33. Yu, T., Noghabi, S. A., Raindel, S., Liu, H., Padhye, J., & Sekar,V. (2016). Freeflow: High performance container networking. InProceedings of the 15th ACM workshop on hot topics in networks(pp. 43–49). New York: ACM.

34. Jiang, D., Huo, L., & Song, H. (2018). Rethinking behaviors andactivities of base stations in mobile cellular networks based on bigdata analysis. IEEE Transactions on Network Science and Engi-neering. (Early Access).

35. Jiang, D., Huo, L., & Li, Y. (2018). Fine-granularity inferenceand estimations to network traffic for SDN. PLoS ONE, 13(5),e0194302.

36. Nie, L., Jiang, D., & Xu, Z. (2013). A compressive sensing-basedreconstruction approach to network traffic. Computers and Elec-trical Engineering, 39(5), 1422–1432.

37. Mao, Y., & Saul, L. K. (2004). Modeling distances in large-scalenetworks by matrix factorization. In Proceedings of the 4th ACMSIGCOMM conference on internet measurement (pp. 278–287).New York: ACM.

38. Kruskal, J. B., & Wish, M. (1978). Multidimensional scaling.Quantitative applications in the social. Beverly Hills: Sage Uni-versity Papers Series.

39. Mclust. http://www.stat.washington.edu/mclust/.40. Fraley, C., Raftery, A., Murphy, T., & Scrucca, L. (2012). mclust

version 4 for R: Normal mixture modeling for model-based clus-tering, classification, and density estimation. Seattle: University ofWashington.

41. HDFS. https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html.

Publisher’s Note Springer Nature remains neutral with regard to juris-dictional claims in published maps and institutional affiliations.

123

Author's personal copy

Page 12: Nanjing University · 123 Author's personal copy. 96 Y.Wangetal. hierarchical clustering to accurately infer the topology in DC networks even in the case of a hybrid of wired and

104 Y. Wang et al.

Yi Wang received the BS, MSand PhD degrees from the Depart-ment of Electronics and Informa-tion Engineering at the HuazhongUniversity of Science and Tech-nology, China, in 2000, 2003, and2009 respectively. She is now alecturer in the School of Elec-tronics Information and Commu-nications, Huazhong University ofScience and Technology, China.Her research interest is Cloudcomputing.

Bingquan Wang received hisBEng degree from School of Elec-tronic Information and Commu-nications, Southeast University,Nanjing, China. He is currentlyworking on completing his Mas-ter’s degree in Nanjing Universityof Science and Technology. Hisresearch interests include peer-to-peer networks and Networking.

Chen Tian is an associate profes-sor with State Key Laboratory forNovel Software Technology, Nan-jing University, China. He waspreviously an associate professorwith School of Electronics Infor-mation and Communications,Huazhong University of Scienceand Technology, China. Dr. Tianreceived the BS (2000), MS (2003)and Ph.D (2008) degrees at Depart-ment of Electronics and Informa-tion Engineering from HuazhongUniversity of Science and Tech-nology, China. From 2012 to 2013,

he was a postdoctoral researcher with the Department of ComputerScience, Yale University. His research interests include data centernetworks, network function virtualization, distributed systems, Inter-net streaming and urban computing.

Fu Xiao received the Ph.D. degreein Computer Science and Tech-nology from Nanjing Universityof Science and Technology, Nan-jing, China, in 2007. He is cur-rently a Professor and PhD super-visor with the School of Com-puter, Nanjing University of Postsand Telecommunications, Nanjing,China. He has published over 30papers in related international con-ferences and journals, includingIEEE Journal on Selected Areasin Communications, IEEE Trans-actions on Networking, IEEE

Transactions on Mobile Computing, INFOCOM, IPCCC, ICC and soon. His main research interest is Wireless Sensor Networks and Inter-net of Things. Dr. Xiao is a member of the IEEE Computer Societyand the Association for Computing Machinery.

123

Author's personal copy