rewire: an optimization-based framework for unstructured data

REWIRE: An Optimization-based Framework forUnstructured Data Center Network Design

Andrew R. Curtis, Tommy Carpenter, Mustafa Elsheikh, Alejandro Lopez-Ortiz and S. KeshavCheriton School of Computer Science

University of Waterloo

Abstract—Despite the many proposals for data center network(DCN) architectures, designing a DCN remains challenging. DCNdesign is especially difficult when expanding an existing network,because traditional DCN design places strict constraints on thetopology (e.g., a fat-tree). Recent advances in routing protocolsallow data center servers to fully utilize arbitrary networks, sothere is no need to require restricted, regular topologies in thedata center. Therefore, we propose a data center network designframework, that we call REWIRE, to design networks usingan optimization algorithm. Our algorithm finds a network withmaximal bisection bandwidth and minimal end-to-end latencywhile meeting user-defined constraints and accurately modelingthe predicted cost of the network. We evaluate REWIRE on awide range of inputs and find that it significantly outperformsprevious solutions—its network designs have up to 100–500%more bisection bandwidth and less end-to-end network latencythan equivalent-cost DCNs built with best practices.

I. INTRODUCTION

Organizations have deployed a considerable amount of datacenter infrastructure in recent years, and much of this hasbeen from the expansion and upgrading of existing facilities.For example, a recent survey found that nearly 2/3s of datacenter operators in the U.S. have added data center capacityin the past 12–24 months and 36% plan on adding capacity in2011 [17]. Most existing DCNs are small- to medium-sized1.These data centers generally use a vendor-defined networkarchitecture, such as Cisco’s [9], arrange the topology as a1+1 redundant tree. Doing so results in under-provisionednetworks, for example, Microsoft researchers [18] found linksas much as 1:240 oversubscribed in their data centers. Suchheavily oversubscribed networks constrain server utilizationbecause they limit agility—the ability to assign any server toany service—and they lack sufficient bisection bandwidth formodern distributed applications such as MapReduce, Dryad[26], partition-aggregate applications [3], and scientific com-puting.

Despite this, operators have little guidance when planningand executing a data center expansion or upgrade. Designing anew or updated network is a challenging optimization problemthat needs to optimize multiple objectives while meeting manyconstraints. Most physical data centers designs are unique,so expansions and upgrades must be custom designed for

175% of survey respondents in the US have data centers withpower load of less than 2.0 MW [17], which implies they run fewerthan 4000 servers if each server draws 500 W (including the powerto cool it).

each data center (see, e.g., industry white papers [41]). Theoptimization challenge is to maximize network performance(which includes bisection bandwidth, end-to-end latency andreliability) while minimizing costs and satisfying a largenumber of constraints.

We propose REWIRE, a framework to design new, upgradedand expanded DCNs. Unlike previous solutions, REWIREdoes not place restrictions on the space of topologies consid-ered. Instead, it maximizes bisection bandwidth and minimizesend-to-end latencies of its network designs by searching thespace of all networks that are feasible under a user-specifieddata center model. We find that arbitrary DCN topologieshave significant performance benefits compared to previouslyconsidered regular topologies. When designing greenfield (i.e.,new) data center networks, REWIRE’s networks have at least500% bisection bandwidth than a fat-tree constructed withthe same budget. When upgrading or expanding an existingnetwork, REWIRE achieves similar performance gains andbeats a greedy algorithm as well. Arbitrary DCN topologiesdo create operational and management concerns since mostDCN architectures are topology-dependent; however, recentproposals support routing and load-balancing on arbitraryDCNs, so this is not a major barrier to adoption. We discussthis further in Sec. V.

REWIRE’s network design algorithm uses local search, soit evaluates many candidate network designs. Computing theperformance of a network involves determining its bisectionbandwidth, and we are not aware of any previous polynomial-time algorithm to compute the bisection bandwidth of anarbitrary network; however, we show that it can be found bysolving a linear program (LP) given by Lakshman et al. tofind an optimal oblivious, or static, routing. Unfortunately,this LP has O(n4) variables and constraints, where n isthe number of switches in the network, so it is expensiveto find even for small networks. To speed this process, weimplement an (1+ε)-approximation algorithm to compute thisLP [30]. We further speed the run-time of this approxima-tion algorithm implementing its bottleneck operation—an all-pairs shortest-path computation—on the GPU using NVIDIA’sCUDA framework [34]. Our implementation is 2–23x fasterthan a high-performance CPU implementation. Additionally,we utilize a heuristic based on the spectral gap of a graph,which is the difference between the smallest two eigenvaluesof a graph’s adjacency matrix. We find that the spectral gap ofa graph is a useful heuristic for candidate selection, especially

2012 Proceedings IEEE INFOCOM

978-1-4673-0775-8/12/$31.00 ©2012 IEEE 1116

when designing greenfield (newly constructed) DCNs.The closest related work to REWIRE is LEGUP [13],

another framework to design DCN upgrades and expansions.However, the LEGUP approach imposes a heterogeneous Clostopology on the updated network, and also requires a Closnetwork as input. We seek a general solution—one that acceptsany network as input and returns arbitrary topologies.

II. DEFINING THE PROBLEM

Designing a data center network is a major undertaking.The solution space is massive due to the huge number ofvariables and an ideal network maximizes many objectivessimultaneously. Our goal is to automate the task of designingthe best network possible given a user’s budget and a modelof their data center. Ideally, a user of our system need onlyhire staff to wire the DCN according to the system’s output.For the remainder of this section, we describe the data centerenvironment and state our assumptions.

A. Network design and assumptions

We now describe DCN workloads and their impact ontopology design.

Workload assumptions: from available DCN measurementstudies [4], [5], [18], [27], we know that DCN workloadsexhibit a high degree of variance, and therefore must beprovisioned accordingly. In a 24 hour time period, there canbe an order of magnitude difference between the peak andminimum load in the DCN [23]. The network needs enoughcapacity to account for this daily fluctuation, and it shouldaccount for future demand. DCN traffic is also unpredictableover short periods. Studies indicate that DCNs exhibit highlyvariable traffic [4], [27], that is, the traffic matrix (TM) in aDCN shifts frequently and its overall volume (i.e., the sum ofits entries) changes dramatically in short periods.

Because of the variability in DCN traffic, we assume that anideal DCN can feasibly route all hose traffic matrices, whichare the traffic matrices where a node i sends or receives atmost r(i) traffic at once. Typically r(i) is equal to the sum ofNIC uplinks at i. We denote the polyhedron of hose TMs validfor nodes 1, . . . , n by T (r(1), . . . , r(n)). The switching fabricis never a bottleneck in a network that can feasibly route allhose TMs—instead, flows are limited only by contention forbandwidth at end-host NICs.

Topology design: greenfield data centers are typically builtwith a highly regular topology. Most DCNs, especially small-to medium-sized DCNs, are built following vendor guidelines,which generally arrange the network as a 3-level, 1+1 redun-dant tree topology. In this topology, each rack contains 20–80servers that connect to a top-of-rack (ToR) switch. A typicalToR switch has 48 1 Gbps and up to four 10 Gbps ports. TheseToR switches are the leaves of the multi-rooted switching tree.This tree usually has three levels: the ToR switches connect toa level of aggregation switches, which connect to a core levelmade up of either switches or routers. The core switches aretypically connected to a pair of edge routers which connectthe data center to the internet.

Theoretical topology constructions date back to telephoneswitching networks, when the goal was to interconnect cir-cuits. A variety of topology constructions have been proposedover the years to interconnect hundreds of thousands ofservers, for example, [1], [10], [13], [20]. The general themeof work in this area is to scale-out, that is, these topologiesuse multiple commodity switches in place of a single high-endswitch.

Cost-effective topology design is important to data centeroperation, because providing enough bisection bandwidth iscrucial to reducing the overall cost of operating a data centeras it allows for higher server utilization. Servers account forover 45–60% of the cost of building a large-scale data center[19] and server utilization is generally 10–20% with an averageutilization of 30% being very good [21]. Networks with toolittle bisection bandwidth cannot allocate servers to jobs timelyenough to improve utilization further [18].

B. Switches, links and end-hosts

Most data centers today run on commodity off the shelf(COTS) servers and switches. Using COTS equipment reducescosts, so we assume the network will be composed of COTSswitches. There are many such switches to choose from, eachwith different features. As such, we allow users to define thespecifications of switches available to add to their data center.For example, we enable the user to input details about linecards available for a particular switch type. Also, users canspecify a processing delay for each switch, which indicatesthe amount of time it takes the switch to forward a packetwhen it has no other load.

Links can be optical or copper and can use incompatibleconnector types. Currently, we do not model the differencein link medium or connectors; instead, we assume that all 1Gb ports use the same connector and all 10 Gb ports use thesame connector and that a 10 Gb port can run at 1 Gb. Copperlinks also pose a problem because they are limited to runs ofless than 5–10m when operating at 10 Gbps [33]. It wouldnot be difficult to check for link medium and connector types;however, we do not do so currently.

We assume that the data center operator has full control overend-hosts. That is, they can install custom multipath routingprotocols, like Multipath TCP [37] and SPAIN [32], on allend-hosts. This assumption does not hold in cloud settings,where customers can run their own OS installations. Here, thecloud provider could release a set of OS images that have therequired end-host modifications installed.

C. DCN performance: what’s important to applications?

The two most important characteristics of DCN perfor-mance are high bisection bandwidth and low latency. Bisectionbandwidth is especially important, because when the networkhas enough bisection bandwidth, all servers can fully utilizetheir NICs no matter where they are sending data to orwhat other servers on the network are doing. This is ideal—end-hosts are the limiting factor in such a network, not the

1117

switching fabric. Spare bisection bandwidth also keeps end-to-end latencies across the network low, because links willbe lightly loaded, and hence queuing delays will be minimal.Minimal latency is important for interactive data center jobs,such as search and other partition-aggregate jobs, where upto hundreds of worker servers perform jobs for a service.The results of these jobs are aggregated by a master server.Latency is critical for this type of service because responsesfrom workers that do not respond by a deadline (usually 10–100ms) are not included in the final results, lowering resultquality and potentially reducing revenue [3].

D. Cost model

Estimating the cost of building a DCN design is verydifficult. There are two major obstacles to accurately pricinga DCN design: (1) it is hard to get prices of switches—streetprices are often a fraction of retail prices and can be difficult toobtain. Vendors offer discounts for bulk purchases, so the priceof a switch often decreases as more of its type are purchased.And, (2) it is difficult to estimate the cost of cabling aDCN. Cables that are too long incur installation costs becausethe contractor has to hide cabling. Bundling cables togetherin long runs also reduces the cost of wiring long-distancelinks [33], [36], though we are not aware of any algorithmsto compute the cost of such a wiring. Additionally, it maybe more expensive to wire irregular topologies compared toregular topologies; however, we do not have any data on this,so we bill per link, regardless of the topology structure.

For tractability reasons, we assume fixed prices for switches.That is, the switch prices specified by the user should be anestimate of how much the switch will cost even if just a singleswitch of its type is purchased. For cabling, we divide cablelengths into different categories (short, medium and long) andcharge for a cable based on its length category.

III. REWIRE ALGORITHM

We now describe the operation of REWIRE, starting byformally stating the problem it solves. The REWIRE algorithmperforms local search, so it starts with a candidate solution,which is a network design that does not violate any constraints.It explores the search space of all candidate solutions bymoving from one candidate to another by modifying localproperties of the solution until a near-optimal solution is found.This local search only optimizes the network’s wiring—it doesnot add switches to the network. Therefore, we end this sectionby describing how to extend our approach to add new switchesto the network as well.

A. Optimization problem formulation

REWIRE’s goal is to find a network with maximal perfor-mance, subject to numerous user-specified constraints.

1) Optimization objective: REWIRE designs networks tojointly maximize bisection bandwidth while minimizing theworst-case latency between ToR switches, that is, given thefixed scalars α and β, our objective function is:

maximize : α · bw(G)− β · latency(G)

where bw(G) and latency(G) are defined as follows:

• Bisection bandwidth: denoted bw(G), depends on the rate,r(i), of a node i, which we define as the peak amount oftraffic v can initiate or receive at once. For example, a serverv with a 1 Gbps NIC has r(v) =1 Gbps. For simplification,we aggregate the bandwidth from all servers attached to aToR switch s at that switch, that is, we let the rate r(i) ofa ToR switch i be the sum of the rates of servers directlyconnected to the switch (e.g., a ToR switch connected to 40servers, each with a 1 Gbps NIC, has a rate of 40Gbps).The bisection bandwidth of a network G is then:

bw(G) = minS⊆V

∑e∈δ(S) w(e)

min{∑i∈S r(v),

∑i∈S r(v)}

where δ(S) is the set of edges with one endpoint in S andanother in S = V − S.

• Worst-case latency: is defined as the worst-case shortest-path latency between any pair of ToR switches in thenetwork, where the latency of a path is the sum of queuing,processing, and transmission delays of switches on the path.We assume that the queuing delay at any port in the networkis constant because we have no knowledge of networkcongestion while designing the network.

Both of these metrics have been considered in the design ofDCN topologies, e.g., [18], [33]. However, as far as we know,no previous algorithms could compute the bisection bandwidthof an arbitrary network in polynomial-time. Therefore, wepropose such an algorithm by combining previous theoreticalresults in Sec. III-B1.

2) User-specified constraints: REWIRE incorporates awide range of constraints into its optimization procedure.These are provided by the user, and are:

• Budget. This is the maximum amount of money the user iswilling to spend on the network.

• Existing network topology and specifications. To performan upgrade or expansion, we need the existing topology.If designing a greenfield network, then REWIRE needs aset of ToR switches given as the existing network becauseour current design does not attach servers to ToR switches.This input needs to include specifications for all networkdevices. For switches this includes their neighbors in thenetwork and relevant details such as the number of freeports of each link rate. Our implementation does not supportdifferent link types (e.g., copper vs. optical links); however,it would be easy to extend it so that links include the typeof connectors on the ends.

• Link prices and specifications. We need a price estimate forlabor and parts for each link length category.

• Available switch prices and specifications (optional). If onewould like to add new switches to the network, REWIREneeds as input the prices and specifications of a few switchesavailable on the market. Specifications include number andspeeds of ports, peak power consumption, thermal outputand the number of rack slots the switch occupies.

• Data center model (optional). Consists of the following:

1118

– Physical layout of racks;– Description of each rack’s contents (e.g. switches,

servers, PDUs, number of free slots);– Per rack heat constraints; and/or– Per rack power constraints.The data center model places constraints on individualracks. We use these constraints to, for example, restrict theplacement of new switches to racks with enough free slots.

• Reliability requirements (optional). This places a constrainton the number of links in the min-cut of the network. Thatis, this is the minimum number of link removals necessaryto partition the network into two connected components.

B. Local Search Approach

REWIRE uses simulated annealing (SA) [29] to searchthrough candidate solutions. Due to space constraints, we donot give details of our approach here; see the full versionof this paper for details [12]. Instead, we discuss the maindetails: computing the performance of a solution, selectingcandidate solutions to evaluate and designing networks withnew switches.

1) Evaluating a candidate solution: Our definition of per-formance is the weighted sum of bisection bandwidth andlatency, so we now describe how to compute each metric.

Bisection bandwidth: recall that bisection bandwidth isdefined on the minimal cut of all cuts of a graph. This is tooexpensive to compute directly on arbitrary graphs because agraph can have exponentially many cuts. However, we cancompute the bisection bandwidth of an arbitrary graph bycomputing a two-phase routing for the network.

Two-phase routing, proposed by Lakshman et al. [30], isoblivious, meaning that it finds a static routing that minimizesthe max link utilization for any traffic matrix in a polyhedronof traffic matrices. Two-phase routing divides routing into twophases. During phase one, each node forwards an αi fractionof its ingress traffic to node i. During stage two, nodes forwardtraffic they received during phase one on to its final destination.We say that α1, . . . , αn are the load-balancing parameters ofa graph G = (V,E). The optimal values of the αi valuesdepends on G and the set of hose TMs T (r(1), . . . , r(n)) forV = 1, . . . , n.

Before describing how to compute a two-phase routing, weshow that finding the load-balancing parameters of a networkgives use the network’s bisection bandwidth. We denote a cutof G by (S, S), where S and S are connected components ofG and S = V −S. Let c(S, S) be the capacity of all edges withone endpoint in S and the other in S. The following theoremshows how to compute bisection bandwidth of a graph givenα1, . . . , αn.

Theorem 1 ( [15] ): A network G = (V,E) with noderates r(1), . . . , r(n) and load balancing parameters α1, . . . , αncan feasibly route all hose TMs T (r(1), . . . , r(n)) using multi-path VLB routing if and only if, for all cuts (S, S) of G,

c(S, S) ≥∑i∈S

αi ·∑i∈S

r(i) +∑i∈S

αi ·∑i∈S

r(i)

where S = V − S.This theorem shows that a multi-commodity flow version

of the famous max-flow, min-cut theorem holds for networksusing two-phase routing under the hose model. It relatesbisection bandwidth to the hose traffic model, i.e., if a networkcan feasibly route all hose TMs, then it has full bisectionbandwidth.

Optimal values for α1, . . . , αn can be found using a linearprogram (LP) due to Lakshman et al. [30]. Full details of theirLP can be found in the full version of this paper [12].

Their LP can be computed in polynomial-time using an LPsolver; however, it is computationally expensive because it hasO(n4) constraints and O(n4) variables. In our initial testing,we found that computing this LP for a network with 200 nodesand 400 (directed) edges needs more than 22GB of memoryusing IBM’s CPLEX solver [25]. Even with only 50 nodenetworks, this LP takes up to several seconds to compute.REWIRE’s local search approach needs to evaluate thousandsof candidate solutions; this LP is not fast enough for ourpurposes.

To solve these issues, we implemented an approximationalgorithm by Lakshman et al. [30] to compute α1, . . . , αn inpolynomial-time. This algorithm finds a solution guaranteedto be within an (1 + ε) factor of optimal. The algorithmworks by augmenting node i’s value αi iteratively. At eachiteration, the algorithm computes a weight w(e) for each edgee ∈ E and then pushes flow to i along the shortest-pathsto i based on these weights. The bottleneck operation in thisalgorithm is computing the shortest-path from each node toeach other node given the weights w(e). This operation needsto perform an all-pairs shortest-path (APSP) computation. Thebest running time we are aware of for an APSP algorithmis O(n3) (deterministic) [11] and O(n2) (probabilistic) [35].Because this operation is the bottleneck, we implemented anAPSP solver on a graphics processing unit (GPU). Used thisway, the GPU is a powerful, inexpensive co-processor withhundreds of cores.

We implemented a recursive version of the Floyd-Warshallalgorithm for our APSP function using CUDA [6], [34]. Thealgorithm uses generalized matrix multiplication (GEMM) asan underlying primitive. GEMM exhibits a high degree ofdata parallelism and we can significant speedups by exploitingthis attribute. Finally, we perform a parallel reduction to findthe maximal path in the distance matrix. Parallel reductionis an efficient algorithm for computing associative operatorsin parallel. It uses Θ(n) threads to compute a tree of partialresults in parallel. The number of steps is bounded by thedepth of the tree which is Θ(log n) [22], [24].

Latency: we find latency(G) by computing APSP on thenetwork, where each edge weight represents the expectedtime to forward packets on that link. To estimate this, weset the weight w(i, j) of an edge (i, j) to be the sum ofqueuing delays, forwarding time and processing delay at(i, j)’s endpoints. The processing delay of each switch isspecified by the user. We assume that the forwarding time is1500 Bytes divided by the link rate (1 or 10 Gbps) and that the

1119

queuing delays are constant for all ports. Our assumption ofuniform queuing delays is not realistic but necessary; sincewe assume no knowledge of the network load, we cannotaccurately determine queuing delays.

Therefore, we compute the diameter of a candidate solutionby solving APSPs with these link weights. As discussed above,this is a computationally expensive process, so we use ourGPU-based APSP solver for this computation.

2) Candidate selection: Due to the large search space,finding optimized solutions with simulated annealing takesan infeasible amount of time for large networks, especiallywhen there is significant room for improvement in the network.Therefore, we added the ability to seed REWIRE’s simulatedannealing procedure with a candidate solution. To find a seedcandidate, we use a heuristic based on the spectral cap of agraph.

Due to space constraints, we omit details about a graph’sspectral gap here; see the full version of this paper [12] orany textbook [8] for details. Intuitively, a graph with a “large”spectral gap will be regular, and the lengths of all shortest-paths between node pairs are expected to be similar.

We therefore modify REWIRE to optionally perform a twostage simulated annealing procedure. In stage 1, its objectivefunction is to maximize the spectral gap. The result fromstage 1 is used to seed stage 2, where its objective function isto maximize bisection bandwidth and minimize latency. Thisway, stage 2 starts with a good solution and can convergequicker. When this two stage procedure is used, we sayREWIRE is in hotstart mode.

Note that a network with a maximal spectral gap of allcandidate solutions does not necessarily mean that the networkwill also have high bisection bandwidth. The spectral gapmetric does not take the hose constraints into account, so itis not directly optimizing for bisection bandwidth. Instead, itcreates networks that are well-connected, which tend to havehigh bisection bandwidth (see [15] for details), but that is notnecessarily the case, especially for heterogeneous networks.

C. Adding switches to the network

REWIRE’s simulated annealing procedure does not consideradding new switches to the network—it only optimizes thewiring of a given set of switches. To find network designs withnew switches, we run REWIRE on the input network plus aset of new switches that are not attached to any other switch.REWIRE attaches the new switches to the existing networkrandomly, and then begins its simulated annealing procedure.

While simple, this approach does not scale well. If a userhas input specifications for k new switch types, then we needto run REWIRE k! times to consider all possible combinationsof switch types. We believe this could be improved by applyingheuristics to select a set of new switches; however, we leaveinvestigation of such heuristics to future work.

IV. EVALUATION

We now present our evaluation of REWIRE. First, we de-scribe the inputs used in our experiments, then the approaches

used for comparison with REWIRE. Finally, we present ourresults using REWIRE to design greenfield, upgraded andexpanded networks.

A. Inputs

1) Existing networks: To evaluate REWIRE’s ability todesign upgrades and expansions of existing networks, we usethe University of Waterloo’s School of Computer Sciencemachine room network (denoted by SCS network) as input.The SCS network has 19 ToR, 2 aggregation and 2 corerouters. Each ToR connects to a single aggregation switch witha 1 or 10 Gbps link and both aggregation switches connectto both core switches with 10 Gbps links. The network iscomposed of a heterogeneous set of switches—full details ofthe network are in [12].

To predict the cost of a network design, REWIRE needs thedistance between each ToR switch pair. We do not have thisdata for the SCS network. Therefore, we label each switch witha unique label from 1, . . . , n. The distance between switchesi and j is then |i − j|. The distance from i to the nearest25% of switches is categorized as “short”, the distance to thenext 50% is “medium” and then distance to the final 25% is“long”. We use these distance categories to estimate the priceof adding a link between two switches.

2) Switches and cabling: We separate the costs of addinga cable into the cost of the cable itself and the cost to installit. Mudigonda et al. report that list prices for 10 Gb cablesare between $45–95 for 1m cables and $100–400 for 10mcables depending on the type of cable (copper or optical andits connector types) [33]. Optical cables are more expensivethan copper cables, but they are available in longer lengths. Toobtain a reasonable estimate of cabling costs without creatingtoo much complexity, we divide cable runs into three groups:short, medium and long lengths. The costs we use are shownin Table I. We also charge an installation fee for each lengthgroup (also shown in the table). Whenever an existing cableis moved, we charge the appropriate installation fee given thecable’s length.

Rate Short ($) Medium ($) Long ($)Cable costs

1 Gbps 5 10 2010 Gbps 50 100 200

Installation and re-wiring costs10 20 50

TABLE I: Prices of cables and the cost to install or move cables.

Table II shows the costs we assume to buy various switches.

Ports Watts Price ($)24 1 Gbps 100 25048 1 Gbps 150 1,500

48 1 Gbps, 4 10 Gbps 235 5,00024 10 Gbps 300 6,00048 10 Gbps 600 10,000

TABLE II: Switches used as input in our evaluation (prices takenfrom [13]). Prices are representative of street prices and power drawestimates are based on a typical switch of the type according tomanufacturers’ estimates.

1120

B. Comparison approaches

We compare REWIRE against the following DCN designsolutions.

Fat-tree: was proposed by Leiserson [31], is a k-ary multi-rooted tree, and is a specific form of Clos’s topology construc-tion [10]. We assume a 3-level fat-tree topology and that allswitches in the network must be homogeneous. Building anoptimal fat-tree for a set of servers given switch specificationsis NP-hard [33], so we upper bound the performance that afat-tree network with a specified budget could achieve. To dothis, we compute the number of ports the fat-tree needs, andbound the cost of switches by the min-cost port of a given rate(e.g., a 1 Gb port costs at least $250/24 and a 10 Gb port costsat least $10K/48). To estimate the cost of cabling, we assumethat server to ToR links are free, and that ToR to aggregationswitches are medium length and aggregation to core links arelong length.

Greedy algorithm: we implemented a greedy heuristic todetermine if REWIRE’s more sophisticated local search ap-proach is necessary. The algorithm iterates over all pairs ofswitches as follows. First, it computes the change in bisectionbandwidth and latency that would result from adding a 1Gbps and 10 Gbps between every pair of switches and storesthe result. At the end of each iteration, the algorithm addsthe link that increases the network’s performance the most.If no link changes the agility or latency during an iterationthen a random link is added. This iteration continues untilthe budget is exhausted or no links can be added becauseall ports are full. Note that this algorithm does not rewirethe initial input—it only adds links to the network until thebudget is exhausted. This algorithm performs O(n2) bisectionbandwidth computations at each iteration, and hence does notscale to graphs with more than ∼40 nodes. Therefore, we donot compare against the greedy algorithm for any network withmore than 40 nodes.

LEGUP: is another optimization-based framework for DCNdesign [13]; however, LEGUP imposes structure on the net-work. It constructs a more general form of a Clos networkthat allows the use of heterogeneous switches (both in termsof link rates and port counts), and this structure is considerablymore flexible than a fat-tree or traditional Clos network. Fordetails of LEGUP’s operation, see [13].

Random graph: Singla et al. proposed a DCN architecturebased on random graph topologies [38]. Random graphs havenice connectivity properties, and they showed that it is lessexpensive to build a random graph than a fat-tree much ofthe time. To estimate the performance a random graph canachieve with a specified budget, we determine the expect radixof each ToR switch given number of links one can installwith the budget. Then, we compute the expected bisectionbandwidth and diameter of the network following Singlaet al.’s approach. Note that we use the expected bisectionbandwidth and diameter, rather than explicitly constructingthese networks.

0.1

0.2

0.3

0.4

Fat-t

ree

Rand

omLE

GU

PRE

WIR

E

Fat-t

ree

Rand

omLE

GU

PRE

WIR

E

Fat-t

ree

Rand

omLE

GU

PRE

WIR

E

Fat-t

ree

Rand

omLE

GU

PRE

WIR

E

Bise

ctio

n ba

ndw

idth

Budget = $125/rack $250/rack $500/rack $1000/rack

Fat-t

ree

Rand

omLE

GU

PRE

WIR

E

Fat-t

ree

Rand

omLE

GU

PRE

WIR

E

Fat-t

ree

Rand

omLE

GU

PRE

WIR

E

$250/rack $500/rack $1000/rack

Diameter: 4 0 4 4 3 4 3 4 3 4 3 4 2 4 4 3 4 4 2 4 3 4 2

ToR switches with 48 1 Gb ports 48 1 Gb + 4 10 Gb ports

Fig. 1: Results of designing greenfield networks using a fat-tree,random graph and REWIRE for two ToR switch types. The results onthe left used ToR switches with 48 1 Gbps ports and the results on theright used ToR switches with 48 1 Gbps ports and 4 10 Gbps. Missingbars for the random graph indicate that the network is expected to bedisconnected. A network with bisection bandwidth 1 has full bisectionbandwidth, and a network with diameter 1 is fully connected.

C. REWIRE settings

REWIRE can operate in several different modes. These are:• Spectral gap mode: sets REWIRE to maximize the spectral

gap of the solution as described in Sec. III-B2.• CPLEX or approximation: sets the method REWIRE uses to

compute the bisection bandwidth of a network. In CPLEXmode, REWIRE uses IBM’s CPLEX solver [25] to computethe bisection bandwidth exactly; whereas in approximationmode, REWIRE finds the bisection bandwidth of a candi-date solution using the FPTAS described in Sec. III-B1.

• Hotstart: this mode finds a candidate solution in spectral gapmode, which is used as a seed solution to stage 2, wherethe objective function is changed to our normal definitionof performance. See III-B2 for full details.

We note what mode REWIRE was in for each experiment.Finally, REWIRE uses a local search algorithm, so it may

not always find an optimal answer, especially if it is not givenenough time to run. For all experiments in this paper, we letREWIRE run for 72 hours. This was typically not enough timefor REWIRE to converge to an optimal answer, so we makeno claim that our results here are the best that REWIRE canobtain.

D. Greenfield networks

We begin by evaluating the effectiveness of REWIRE atdesigning greenfield, i.e., new, DCNs. For this experiment,the input to REWIRE is a set of ToR switches and we useREWIRE in approximation mode. Initially, ToR switches areeach attached to servers, but no other switches. The total costof the network is the cost of these ToR switches plus the wiringbudget. We experimented with two types of ToR switches.First, we set all ToR switches have 48 1 Gbps ports, where 24ports attach to servers and the other 24 are left open. Then, weset all ToR switches to have 48 1 Gbps ports and 4 10 Gbps;each ToR switch attaches to 40 servers with 1 Gbps ports. Forboth experiments, we built networks to connect 3200 servers.

We compare against the fat-tree, LEGUP, and randomtopology approaches. The results are shown in Figure 1. In the

1121

0.1

0.2

0.3

0.4

0.5

Inpu

t

Fat-t

ree

Gre

edy

REW

IRE

Fat-t

ree

Gre

edy

REW

IRE

Fat-t

ree

Gre

edy

REW

IRE

Fat-t

ree

Gre

edy

REW

IRE

Bise

ctio

n ba

ndw

idth

Budget = $1250 $2500 $5000 $10,000

Diameter: 4 4 4 4 4 4 4 4 3 3 4 3 3

Fig. 2: Results of upgrading the SCS topology with different budgetsand algorithms.

chart, the bars show normalized bisection bandwidth (higheris better and a network with full bisection bandwidth has anormalized bisection bandwidth of 1). The table below thebars indicates the diameter (the worst-case hop count betweenToR switches) of the network. We do not compare against thegreedy heuristic for these experiments because it is not fastenough for networks with more than 40 nodes.

REWIRE significantly outperforms the other approachesfor nearly all budgets. The random network has more bisec-tion bandwidth than REWIRE’s network when the budget is$5,000; however, REWIRE’s solution has less latency (thisnetwork has a diameter one hop less than the expected randomnetwork’s). We weighted bisection bandwidth and latencyequally in this experiment, so REWIRE preferred the solutionwith less bandwidth, but also less latency. This illustrates theflexibility of REWIRE—one can tune its parameters depend-ing on their needs.

We also re-ran the REWIRE experiments using its spectralgap mode. We found that the solutions with a maximalspectral gap had the same performance as the solutions foundby REWIRE in approximation mode. This implies that thespectral gap is good metric when designing greenfield datacenters because it seems to maximize bisection bandwidth andit finds networks with very regular topologies, which wouldpossibly reduce the cost of wiring the network.

E. Upgrading

We now evaluate REWIRE’s ability to find upgrades toexisting DCNs. To begin, we compared REWIRE to a fat-tree and our greedy algorithm on the SCS network for severalbudgets. The results are shown in Figure 2.

REWIRE significantly outperforms the fat-tree—its net-works have 120–530% more bisection bandwidth than a fat-tree constructed with the same budget, and with budgetsover $5K, REWIRE’s network also has a shorter diameter(by 1 hop) than a fat-tree. REWIRE also outperforms thegreedy algorithm for all budgets, though the greedy algorithmperforms nearly as well when the budget is $5K or more. Thisindicates that a greedy approach performs very well in somesettings; however, we have generally observed that the greedyalgorithm does not perform well when it has a small budgetor they input is very constrained and has few open ports. As

0.1

0.2

0.3

0.4

0.5

Inpu

t

Spec

tral G

ap

Appr

ox +

Hos

tart

CPL

EX +

Hot

star

t

Appr

ox

CPL

EX

Spec

tral G

ap

Appr

ox +

Hos

tart

CPL

EX +

Hot

star

t

Appr

ox

CPL

EX

Bise

ctio

n ba

ndw

idth

Budget = $2500 $10,000

Diameter: 4 6 6 6 4 4 3 3 3 3 3

Fig. 3: Results of upgrading the SCS topology with different REWIREmodes and two budgets.

0

0.01

0.02

0.03

0.04

0.05

Orig

inal

Fat-

tree

1G

b

GRE

EDY

LEG

UP

REW

IRE

Fat-

tree

1G

b

GRE

EDY

LEG

UP

REW

IRE

Fat-

tree

1G

b

GRE

EDY

LEG

UP

REW

IRE

Fat-

tree

1G

b

GRE

EDY

LEG

UP

REW

IRE

Diameter: 4 4 3 4 3 4 3 4 3 4 2 4 3 4 3 4 2

0 160 320 480 640 Cumulative number of servers added

Bise

ctio

n ba

ndw

idth

Fig. 4: Results of iteratively expanding the SCS data center.

an example, when the budget is $2,500, REWIRE’s networkhas 350% more bisection bandwidth than the network foundby the greedy algorithm.

Next, we compared the various modes of REWIRE for twobudgets as shown in Figure 3. We observe that the spectral gapand hotstart modes performs poorly when the budget is $2,500.This is likely due to the properties of the spectral gap, whichtries to make the network more regular. Because the budget isnot large enough to re-wire the network in this regular fashion,optimizing the network’s spectral gap creates a candidatesolution with poor bisection bandwidth. This problem does notarise when the budget is large enough (as in the case whenthe budget is $10K), because there is enough money to re-wirethe network into this regular structure.

F. Expanding

We now examine the performance of the algorithms as weexpand the data center over time by incrementally adding newservers. We tested two expansion scenarios here.

First, we expanded the SCS data center by adding 160servers at a time, until we have added a total of 640 serversto the data center. The results are shown in Figure 4. For

1122

0

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Fat-

tree

1G

bFa

t-tr

ee 1

0Gb

LEG

UP

REW

IRE

Fat-

tree

1G

bFa

t-tr

ee 1

0Gb

LEG

UP

REW

IRE

Fat-

tree

1G

bFa

t-tr

ee 1

0Gb

LEG

UP

REW

IRE

Fat-

tree

1G

bFa

t-tr

ee 1

0Gb

LEG

UP

REW

IRE

Fat-

tree

1G

bFa

t-tr

ee 1

0Gb

LEG

UP

REW

IRE

1600 2000 2400 2800 3200Total servers in data center

Diameter: 4 4 4 3 4 4 4 3 4 4 4 3 4 4 4 3 4 4 4 2

Bise

ctio

n ba

ndw

idth

Fig. 5: Results of iteratively expanding a greenfield network.

REWIRE and the greedy algorithm, we used ToR switcheswith 48 1 Gbps and 4 10 Gbps ports, so each ToR attachesto 40 servers. For the 1 Gb fat-tree, we used ToR switcheswith 24 1 Gbps ports. The budgets shown in the figure are thecabling budgets given to REWIRE; not the total budget thatincludes switches.

We found that REWIRE outperforms the fat-tree, the greedyalgorithm, and LEGUP in this scenario. The fat-tree is not ableto improve the bisection bandwidth of the expanded networkbeyond the bisection bandwidth of the initial SCS DCN,whereas the greedy algorithm, LEGUP, and REWIRE do.This scenario shows the limitations of the greedy algorithm.After four expansion iterations, REWIRE’s network has nearlytwice as much bisection bandwidth as the greedy algorithm’snetwork, and LEGUP, despite its topology restrictions, alsooutperforms the greedy algorithm after the second iteration.

Next, we evaluated REWIRE’s performance when con-structing, and then expanding a greenfield data center. Inthese experiments, we built the initial DCN with a budget of$40K using LEGUP, REWIRE, or a fat-tree with 1 or 10 Gblinks. This initial data center contained 1600 servers. Then, weiteratively expanded this data center by adding 400 servers ata time. For each expansion, the algorithms were given a totalbudget of $60K (this includes the cost of ToR switches). Theresults are shown in Figure 5.

Again, REWIRE performed better than either fat-tree con-figuration and LEGUP. Its initial network has more bisec-tion bandwidth than the networks designed by the otherapproaches, and it maintains this lead as the data center isiteratively expanded.

G. Quantitative results

Time is a bottleneck in our network design algorithm. Whenusing a single CPU core, it can take up to two weeks to con-verge when the input contains 200 switches. To speed this up,we implemented a GPU-based all-pairs shortest-path (APSP)solver. As the APSP computation is the bottleneck operationin REWIRE’s operation, we found that doubling the speed of

the APSP computations nearly halves REWIRE’s total run-time. When compared to a naive APSP solver (implementedfollowing [11]), we found our GPU-based implementationis 28–244x faster, with further speedups for larger networks(full details are in [12]). Further implementation speedups arepossible; for example, by partitioning the search space anddistributing the computation across multiple machines.

V. OPERATING AN UNSTRUCTURED DCN

Because of their performance benefits, we advocate theadoption of non-regular topologies in the data center. Doingso, however, raises architecture issues. In particular, we needto be able to provide addressing, routing, load-balancing andcost-effective management on unstructured DCN topologies ifthey are to be of practical use. We now show how previouswork can perform these functions.

Addressing and routing: have been the focus of muchrecent work due to the difficulty of scaling traditional, dis-tributed control-plane protocols to connect more than a couplethousand servers. Protocols such as SEATTLE [28] and TRILL[40] provide scalable L2 functionality on arbitrary topologies,but do not provide multipath routing, which is needed tofully utilize dense networks. However, we can get scalableL2 networking and multipath routing by using SPAIN [32].SPAIN uses commodity switches and servers. It modifies end-hosts and uses VLANs to partition the network into path-disjoint subgraphs.

Load-balancing: our algorithms assume that a network’sbandwidth can be fully exploited by the load-balancing mech-anism. This assumption is not valid when using single-pathprotocols like spanning tree; however, near-optimal load-balancing can be achieved on arbitrary topologies by usingMultipath TCP [37] or SPAIN [32]. Based on their evaluations,we believe that these multipath protocols would be able tofully utilize REWIRE’s constructions. More details are in thefull version of this paper [12]. Alternatively, centralized flowcontrollers like Hedera [2] and Mahout [14] could be modifiedto provide near-optimal load-balancing on arbitrary topologies.

Management and configuration: managing an DCN withan irregular topology may be more costly and require moreexpertise than a vendor-specified DCN architecture. In par-ticular, addressing is more difficult to configure an irregulartopology, because we cannot encode topologic locality in thelogical ID of a switch (typically a switch’s logical ID is itstopology-imposed address or label).

To address this issue, we suggest configuring the networkusing Chen et al.’s generic and automatic data center addressconfiguration system (DAC) [7]. Software-defined networking,such as implemented by OpenFlow, is also a promising solu-tion for arbitrary DCN management (see, e.g., [16], [39]). Webelieve these solutions can solve many of the managementproblems that may arise from the introduction of irregulartopologies in the data center, and we leave further investigationto future work.

1123

VI. DISCUSSION

The (1 + ε)-approximation algorithm we implemented tocompute the bisection bandwidth of a network is numericallyunstable. At each iteration of its operation, it performs an all-pairs shortest-path computation. To do this, it needs to com-pare increasingly minute numbers at each successive iteration.We found it returns incorrect shortest-paths trees after enoughiterations because these comparisons are made on numbers lessthan 10−40. Because of this numerical instability, we couldnot run the approximation algorithm on inputs larger than 200nodes and 200 edges. Nor could we run it with very smallvalues of ε because the algorithm performs more iterations asε decreases.

We believe it would be interesting to incorporate outputtopology constraints into REWIRE’s algorithm. This wouldallow users to constrain REWIRE’s output to a family of topol-ogy constructions (e.g., Clos, fat-tree , BCube or HyperX).Doing so would generalize algorithms of Mudigonda et al.[33], who proposed algorithms to find min-cost constructionsfor fat-trees and HyperX networks.

We did not explicitly consider designing upgrades or ex-pansions that can be executed with minimal disruption to anexisting DCN. However, it is possible to disable REWIRE’ssupport for moving existing network links. The greedy algo-rithm we compared REWIRE against never moves existinglinks, so we believe its results are indicative of the results sucha modification to REWIRE would have. Another approach isto modify the cost constraints (e.g., moving a link costs fivetimes more than adding a new one) so that rare, significantlybeneficial re-wirings are permissible.

VII. CONCLUSIONS

In this paper, we proposed REWIRE, a framework to designdata center networks. REWIRE uses local search to find anetwork that maximizes bisection bandwidth while minimizinglatency and satisfying a large number of user-defined con-straints. REWIRE’s network design algorithm finds networksthat significantly outperform networks found by other pro-posals. This optimization-based approach is flexible—it candesign greenfield networks and upgrades or expansions ofexisting networks—and effective. For example, REWIRE findsgreenfield networks with over 500% more bisection bandwidthand less worst-case end-to-end latency than a fat-tree built withthe same budget.

To achieve these results, REWIRE builds unstructured net-works, rather than the topology-constrained networks mostexisting data center use. REWIRE demonstrates that arbitrarytopologies can boost DCN performance while reducing net-work equipment expenditure. Traditionally DCN design hasrestricted the topology to only a few classes of topologies,because it is difficult to operate an arbitrary topology in ahigh-performance environment. These difficulties have beenmitigated by recent work [7], [32], so it may be time tomove away from highly regular DCN topologies because ofthe performance benefits arbitrary topologies offer.

REFERENCES

[1] J. H. Ahn, N. Binkert, A. Davis, M. McLaren, and R. S. Schreiber. Hyperx:topology, routing, and packaging of efficient large-scale networks. In Proceedingsof the Conference on High Performance Computing Networking, Storage andAnalysis (SC ’09), 2009.

[2] M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahdat. Hedera:Dynamic Flow Scheduling for Data Center Networks. In NSDI, 2010.

[3] M. Alizadeh et al.. DCTCP: Efficient packet transport for the commoditized datacenter. In SIGCOMM, 2010.

[4] T. Benson, A. Akella, and D. Maltz. Network traffic characteristics of data centersin the wild. In IMC, 2010.

[5] T. Benson, A. Anand, A. Akella, and M. Zhang. Understanding data center trafficcharacteristics. In Proceedings of the 1st ACM workshop on Research on enterprisenetworking (WREN), 2009.

[6] A. Buluc, J. R. Gilbert, and C. Budak. Solving path problems on the GPU. ParallelComput., 36:241–253, June 2010.

[7] K. Chen, C. Guo, H. Wu, J. Yuan, Z. Feng, Y. Chen, S. Lu, and W. Wu. Genericand automatic address configuration for data center networks. In SIGCOMM, 2010.

[8] F. R. K. Chung. Spectral Graph Theory. 1994.[9] Cisco. Cisco data center network architecture and solutions overview. White paper,

2006. Available online (19 pages).[10] C. Clos. A study of non-blocking switching networks. Bell System Technical

Journal, 32(5):406–424, 1953.[11] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to

Algorithms. The MIT Press, 3rd edition, 2009.[12] A. R. Curtis et al.. REWIRE: An optimization-based framework for data center

network design. Technical report, University of Waterloo, 2011.[13] A. R. Curtis, S. Keshav, and A. Lopez-Ortiz. LEGUP: Using heterogeneity to

reduce the cost of data center network upgrades. In CoNEXT, 2010.[14] A. R. Curtis, W. Kim, and P. Yalagandula. Mahout: Low-overhead datacenter

traffic management using end-host-based elephant detection. In INFOCOM, 2011.[15] A. R. Curtis and A. Lopez-Ortiz. Capacity provisioning a Valiant load-balanced

network. In INFOCOM, 2009.[16] A. R. Curtis, J. C. Mogul, J. Tourrilhes, P. Yalagandula, P. Sharma, and S. Baner-

jee. DevoFlow: scaling flow management for high-performance networks. InSIGCOMM, 2011.

[17] Digital Reality Trust. what is driving the US market? White paper, 2011. Availableonline at http://knowledge.digitalrealtytrust.com/.

[18] A. Greenberg et al.. VL2: a scalable and flexible data center network. InSIGCOMM, 2009.

[19] A. G. Greenberg, J. R. Hamilton, D. A. Maltz, and P. Patel. The cost of a cloud:research problems in data center networks. Computer Communication Review,39(1):68–73, 2009.

[20] C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, and S. Lu.BCube: a high performance, server-centric network architecture for modular datacenters. In SIGCOMM, 2009.

[21] J. R. Hamilton. Data center networks are in my way. Presented at the StandfordClean Slate CTO Summit, 2009.

[22] W. D. Hillis and G. L. Steele, Jr. Data parallel algorithms. Commun. ACM,29:1170–1183, December 1986.

[23] U. Hoelzle and L. A. Barroso. The Datacenter as a Computer: An Introductionto the Design of Warehouse-Scale Machines. Morgan and Claypool Publishers,2009.

[24] D. Horn. Stream Reduction Operations for GPGPU Applications, volume 2,chapter 36. 2005.

[25] IBM. IBM ILOG CPLEX Optimizer. http://www-01.ibm.com/software/integration/optimization/cplex-optimizer/.

[26] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys, 2007.

[27] S. Kandula, S. Sengupta, A. Greenberg, and P. Patel. The nature of datacentertraffic: Measurements & analysis. In IMC, 2009.

[28] C. Kim, M. Caesar, and J. Rexford. Floodless in seattle: a scalable ethernetarchitecture for large enterprises. In SIGCOMM, 2008.

[29] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulatedannealing. Science, 220(4598):671–680, 1983.

[30] M. Kodialam, T. V. Lakshman, J. B. Orlin, and S. Sengupta. Oblivious routingof highly variable traffic in service overlays and IP backbones. IEEE/ACM Trans.Netw., 17:459–472, April 2009.

[31] C. E. Leiserson. Fat-trees: universal networks for hardware-efficient supercomput-ing. IEEE Trans. Comput., 34(10):892–901, 1985.

[32] J. Mudigonda, P. Yalagandula, M. Al-Fares, and J. C. Mogul. SPAIN: COTSdata-center ethernet for multipathing over arbitrary topologies. In NSDI, 2010.

[33] J. Mudigonda, P. Yalagandula, and J. C. Mogul. Taming the flying cable monster: Atopology design and optimization framework for data-center networks. In USENIXATC, 2011.

[34] NVIDIA. NVIDIA CUDA. http://www.nvidia.com/object/cuda home new.html.[35] Y. Peres, D. Sotnikov, B. Sudakov, and U. Zwick. All-pairs shortest paths in

o(n2) time with high probability. In FOCS ’10, 2010.[36] L. Popa, S. Ratnasamy, G. Iannaccone, A. Krishnamurthy, and I. Stoica. A cost

comparison of data center network architectures. In CoNEXT, 2010.[37] C. Raiciu et al.. Improving datacenter performance and robustness with multipath

TCP. In SIGCOMM, 2011.[38] A. Singla, C.-Y. Hong, L. Popa, and P. B. Godfrey. Jellyfish: Networking data

centers randomly. In NSDI, 2012.[39] A. Tavakoli, M. Casado, T. Koponen, and S. Shenker. Applying NOX to the

datacenter. In HotNets, 2009.[40] J. Touch and R. Perlman. Transparent Interconnection of Lots of Links (TRILL):

Problem and Applicability Statement. RFC 5556 (Informational), May 2009.[41] A. Vierra. Case study: NetRiver rethinks data center infrastructure design. Focus

Magazine, August 2010. Available online at http://bit.ly/oS4pCu.

1124

rewire: an optimization-based framework for unstructured data

Documents