decentralized consistent updates in sdn

Decentralized Consistent Updates in SDN

Thanh Dang Nguyen∗University of Chicago

Marco ChiesaUniversité catholique de Louvain

Marco Canini∗KAUST

ABSTRACTWe present ez-Segway, a decentralized mechanism toconsistently and quickly update the network statewhile preventing forwarding anomalies (loops and black-holes) and avoiding link congestion. In our design, thecentralized SDN controller only pre-computes informa-tion needed by the switches during the update execu-tion. This information is distributed to the switches,which use partial knowledge and direct message pass-ing to efficiently realize the update. This separation ofconcerns has the key benefit of improving update perfor-mance as the communication and computation bottle-necks at the controller are removed. Our evaluations vianetwork emulations and large-scale simulations demon-strate the efficiency of ez-Segway, which compared toa centralized approach, improves network update timesby up to 45% and 57% at the median and the 99th

percentile, respectively. A deployment of a system pro-totype in a real OpenFlow switch and an implementa-tion in P4 demonstrate the feasibility and low overheadof implementing simple network update functionalitywithin switches.

CCS Concepts•Networks → Network dynamics; Network pro-tocol design; Network manageability;

1. INTRODUCTIONUpdating data plane state to adapt to dynamic con-

ditions is a fundamental operation in all centrally-controlled networks. Performing updates as quicklyas possible while preserving certain consistency prop-erties (like loop, black-hole or congestion freedom) is a

∗Work performed at Universite catholique de Louvain.

Permission to make digital or hard copies of part or all of this work for personalor classroom use is granted without fee provided that copies are not made ordistributed for profit or commercial advantage and that copies bear this noticeand the full citation on the first page. Copyrights for third-party componentsof this work must be honored. For all other uses, contact the Owner/Author(s).Copyright is held by the owner/author(s).

SOSR ’17, April 3–4, 2017, Santa Clara, CA, USA

c© 2017 Copyright held by the owner/author(s). Publication rights licensed toACM. ISBN 978-1-4503-4947-5/17/04. . . $15.00

DOI: http://dx.doi.org/10.1145/3050220.3050224

common challenge faced in many recent SDN systems(e.g., [11, 13,17]).

Network updates are inherently challenging becauserule-update operations happen across unsynchronizeddevices and the consistency properties impose depen-dencies among operations that must be respected toavoid forwarding anomalies and worsened network per-formance [6,14]. Yet, performing updates as fast as pos-sible is paramount in a variety of scenarios ranging fromperformance to fault tolerance to security [14,21,27] andis a crucial requirement for “five-nines” availability ofcarrier-grade and mobile backhaul networks [23].

Ideally, a network would have the capability to in-stantaneously update its network-wide state while pre-serving consistency. Since updates cannot be ap-plied at the exact same instant at all switches, recentwork [23] explored the possibility of leveraging clock-synchronization techniques to minimize the transitiontime. This requires just a single round of communi-cation between the controller and switches. However,with this “all-in-one-shot” update style, packets in flightat the time of the change can violate consistency prop-erties due to the complete lack of coordination [14] orare dropped as a result of scheduling errors due to clocksynchronization imprecisions [23].

The bulk of previous work in this area [14, 16, 18, 19,22,27,28] focused on maintaining consistency properties.However, all these approaches require multiple roundsof communications between the controller (which drivesthe update) and the switches (which behave as remotepassive nodes that the controller writes state to and isnotified from). This controller-driven update processhas four important drawbacks:

First, because the controller is involved with the in-stallation of every rule, the update time is inflated byinherent delays affecting communication between con-troller and switches. As a result, even with state-of-the-art approaches [14], a network update typically takesseconds to be completed (results show 99th percentilesas high as 4 seconds).

Second, because scheduling updates is computation-ally expensive [14,22,27], the update time is slowed downby a centralized computation.

http://dx.doi.org/10.1145/3050220.3050224

Third, because the controller can only react tonetwork dynamics (e.g., congestion) at control-planetimescales, the update process cannot quickly adapt tocurrent data plane conditions.

Fourth, in real deployments where the controller isdistributed and switches are typically sharded amongcontrollers, network updates require additional coordi-nation overhead among different controllers. For wide-area networks, this additional synchronization can addsubstantial latency [7].

In this paper, we present ez-Segway, a new mecha-nism for network updates. Our key insight is to in-volve the switches as active participants in achievingfast and consistent updates. The controller is responsi-ble for computing the intended network configuration,and for identifying “flow segments” (parts of an updatethat can be updated independently and in parallel) anddependencies among segments. The update is realizedin a decentralized fashion by the switches, which exe-cute a network update by scheduling update operationsbased on the information received by the controller andmessages passed among neighboring switches. This al-lows every switch to update its local forwarding rulesas soon as the update dependencies are met (i.e., whena rule can only be installed after dependent rules areinstalled at other switches), without any need to coor-dinate with the controller.

Being decentralized, our approach differs significantlyfrom prior work. It achieves the main benefit of clock-synchronization mechanisms, which incur only a sin-gle round of controller-to-switches communication, withthe guarantees offered by coordination to avoid forward-ing anomalies and congestion. To the best of our knowl-edge, we are the first to explore the benefits of del-egating network updates’ coordination and schedulingfunctionalities to the switches. Perhaps surprisingly, wefind that the required functionality can be readily imple-mented in existing programmable switches (OpenFlowand P4 [1]) with low overhead.

Preventing deadlocks is an important challenge of ourupdate mechanism (§3). To address this problem, wedevelop two techniques: (i) flow segmentation, whichallows us to update different flow segments indepen-dently of one another, thus reducing dependencies, and(ii) splitting volume, which divides a flow’s traffic ontoits old and new paths, thus preventing link congestion.

We show that our approach leads to faster networkupdates, reduces the number of exchanged messages inthe network, and has low complexity for the schedul-ing computation. Compared to state-of-the-art central-ized approaches (e.g., Dionysus [14]) configured withthe most favorable controller location in the network,ez-Segway improves network update time by up to 45%and 57% at the median and the 99th percentile, respec-tively, and it reduces message overhead by 65%. Toput these results into perspective, consider that Diony-sus improved the network update time by up to a fac-tor of 2 over the previous state-of-the-art technique,

SWAN [11]. Overall, our results show significant up-date time reductions, with speed of light becoming themain limiting factor.

Our contributions are as follows:• We present ez-Segway (§4), a consistent updatescheduling mechanism that runs on switches, initiallycoordinated by a centralized SDN controller.• We assess our prototype implementation (§5) by run-ning a comprehensive set of emulations (§6) and simu-lations (§7) on various topologies and traffic patterns.•We validate feasibility (§8) by running our system pro-totype on a real SDN switch and present microbench-marks that demonstrate low computational overheads.Our prototype is available as open source athttps://github.com/thanh-nguyen-dang/ez-segway.

2. NETWORK UPDATE PROBLEMWe start by formalizing the network update problem

and the properties we are interested in. The networkconsists of switches S={si} and directed links L={`i,j},in which `i,j connects si to sj with a certain capacity.Flow modeling. We use a standard model for char-acterizing flow traffic volumes as in [6, 11, 14]. A flowF is an aggregate of packets between an ingress switchand an egress switch. Every flow is associated with atraffic volume vF . In practice, this volume could be anestimate that the controller computes by periodicallygathering switch statistics [6,14] or based on an alloca-tion of bandwidth resources [6,13]. The forwarding stateof a flow consists of an exact match rule that matchesall packets of the flow. As in previous work [14], weassume that flows can be split among paths by meansof weighted load balancing schemes such as WCMP orOpenFlow-based approaches like Niagara [15].Network configurations and updates. A networkconfiguration C is the collection of all forwarding statesthat determine what packets are forwarded betweenany pair of switches and how they are forwarded (e.g.,match-action flow table rules in OpenFlow). Given twonetwork configurations C,C′, a network update is a pro-cess that replaces the current network configuration Cby the target one C′.Properties. We focus on these three properties ofnetwork updates: (i) black-hole freedom: No packet isunintendedly dropped in the network; (ii) loop-freedom:No packet should loop in the network; (iii) congestion-freedom: No link should be loaded with a traffic greaterthan its capacity. These properties are the same as theones studied in [6, 21].Update operations. Due to link capacity limits andthe inherent difficulty in synchronizing the changes atdifferent switches, the link load during an update couldget significantly higher than that before or after theupdate and all flows of a configuration cannot be movedat the same time. Thus, to minimize disruptions, it isnecessary to decompose a network update into a set ofupdate operations Π.

https://github.com/thanh-nguyen-dang/ez-segway

FR:5

FG:5

FB :5 s1 s2 s3 s4 s5

s6 s7

(a) Configuration C1

s1 s2 s3 s4 s5

s6 s7

(b) Configuration C′1

Figure 1: An example network update. Three flowsFR,FG,FB are to be updated in the new configura-tion C′

1. Update operations must be ordered care-fully in order to preserve consistency and avoid con-gestion.

Intuitively, an update operation π denotes the oper-ation necessary to move a flow F from the old to thenew configuration: in the context of a single switch, thisrefers to the addition or deletion of F ’s forwarding statefor that switch.

Thus, unlike per-packet consistent updates [28], weallow a flow to be routed through a mix of the old andnew configuration, unless other constraints make it im-possible (e.g., a service chain of middle-boxes must bevisited in reversed order in the two configurations).Dependency graph. To prevent a violation of ourproperties, update operations are constrained in the or-der of their execution. These dependencies can be de-scribed with the help of a dependency graph, which willbe explained in detail in §4. At any time, only the up-date operations whose dependencies are met in the de-pendency graph are possibly executed. That leads thenetwork to transient intermediate configurations. Theupdate is successful when the network is transformedfrom the current configuration C to the target configu-ration C′ such that for all intermediate configurations,the aforementioned three properties are preserved.

3. OVERVIEWOur goal is to improve the performance of updating

network state while avoiding forwarding loops, black-holes, and congestion. In contrast with prior approaches– all of which use a controller to plan and coordinatethe updates – we explore the question: can the net-work update problem be solved in a decentralized fash-ion wherein switches are delegated the task of imple-menting network updates?

3.1 Decentralizing for fast updatesConsider the example of seven switches s1, · · · , s7

shown in Figure 1. Assume each link has 10 units ofcapacity and there are three flows FR,FG,FB , each ofsize 5. This means that every link can carry at most2 flows at the same time. We denote a path through asequence of nodes s1, . . . , sn by (s1 . . . sn).

The network configuration needs to be updated fromC1 to C′1. Note that we cannot simply transition to C′1by updating all the switches at the same time. Sinceswitches apply updates at different times, such a strat-egy can neither ensure congestion freedom nor loop- andblack-hole- freedom. For example, if s2 is updated toforward FR on link `2,6 before the forwarding state forFR is installed at s6, this results in a temporary black-

hole. Moreover, if s2 forwards FR on link `2,6 before FB

is moved to its new path, then `2,6 becomes congested.Ensuring that the network stays congestion free and

that the consistency of forwarding state is not violatedrequires us to carefully plan the order of update opera-tions across the switches. In particular, we observe thatcertain operations depend on other operations, leadingpotentially to long chains of dependencies for non triv-ial updates. This fact implies that a network updatecan be slowed down by two critical factors: (i) theamount of computation necessary to find an appropri-ate update schedule, and (ii) the inherent latencies af-fecting communication between controller and switchessummed over multiple rounds of communications to re-spect operation dependencies.

But is it necessary to have the controller involved atevery step of a network update? We find that it is possi-ble to achieve fast and consistent updates by minimizingcontroller involvement and delegating to the switch thetasks of scheduling and coordinating the update pro-cess.

We leverage two main insights to achieve fast, consis-tent updates. Our first insight is that we can completean update faster by using in-band messaging betweenswitches instead of coordinating the update at the con-troller, which pays the costs of higher latency. Our sec-ond insight is that it is not always necessary to movea flow as a whole. We can complete the update fasterby using segmentation, wherein different “segments” ofa flow can be updated in parallel. For example, in Fig-ure 1, both flows FR and FB can be decomposed in twoindependent parts: the first between s2 and s3, and thesecond between s3 and s4.Distributed update execution. Before we intro-duce our approach in detail, we illustrate the executionof a decentralized network update for the example ofFigure 1. Initially, the controller sends to every switcha message containing the current configuration C1, andtarget configuration C′1.1 This information allows everyswitch to compute what forwarding state to update (byknowing which flows traverse it and their sizes) as wellas when each state update should occur (by obeyingoperation dependencies while coordinating with otherswitches via in-band messaging).

In the example, switch s2 infers that link `2,3 hasenough capacity to carry FB and that its next hopswitch, s3, is already capable of forwarding FB (becausethe flow traverses it in both the old and new configura-tion). Hence, s2 updates its forwarding state so as tomove FB from path (s2s6s3) to (s2s3). It then notifiess6 about the update of FB , allowing s6 to safely removethe forwarding state corresponding to this flow.

Once notified by s2, s6 infers that link `6,3 now hasavailable capacity to carry FR. So, it installs the corre-sponding forwarding state and notifies s2, which is theupstream switch on the new path of FR. Then, s2 infers

1We revisit later what the controller sends precisely.

that the new downstream switch is ready and that `2,6has enough capacity, so it moves FR to its new path.

Similarly, s3 updates its forwarding state for FR toflow on `3,4, notifying s7 about the update. Meanwhile,switch s7 infers that link `7,4 has enough capacity tocarry FB ; then, it installs forwarding state for FB andnotifies s3. Then s3 moves FB onto `3,7.

Notice that several update operations can run in par-allel at multiple switches. However, whenever opera-tions have unsatisfied dependencies, switches must co-ordinate. In this example, the longest dependency chaininvolves the three operations that must occur in se-quence at s2, s6, and s2 again. Therefore, the aboveexecution accumulates the delay for the initial mes-sage from the controller to arrive at the switches plusa round-trip delay between s2 and s6. In contrast, ifa centralized controller performed the update followingthe same schedule, the update time would be affected bythe sum of three round-trip delays (two to s2 and oneto s6). We aim to replace the expensive communica-tion (in terms of latency) between the switches and thecontroller with simple and fast coordination messagesamong neighboring switches.

3.2 Dealing with deadlocksAs in previous work on network update problems, it

is worth asking: is it always possible to reach the tar-get configuration while preserving all the desired con-sistency properties and resource constraints during thetransition? Unfortunately, similarly to the centralizedcase [14], some updates are not feasible: that is, even ifthe target configuration do not violate any of the threeproperties, there exists no ordering of update operationsto reach the target. For example, in Figure 2, movingfirst either FB or FR creates congestion.

Assume that a feasible update ordering exists. Then,will a decentralized approach always be able to find it?Unfortunately, without computing a possible schedulein advance [22], inappropriate ordering can lead to dead-locks where no further progress can be made. However,as discussed in [14], computing a feasible schedule is acomputationally hard problem.

In practice, even in centralized settings, the currentstate-of-the-art approach, Dionysus [14], cannot entirelyavoid deadlocks. In such cases, Dionysus reduces flowrates to continue an update without violating the con-sistency properties; however, this comes at the cost oflower network throughput.

Deadlocks pose an important challenge for us as adecentralized approach is potentially more likely to en-ter deadlocks due to the lack of global information. Todeal with deadlocks, we develop two techniques thatavoid reducing network throughput (without resortingto rate-limit techniques [14] or reserving some capacitiesfor the update operations [11]).

Our first technique is splitting volume, which dividesa flow’s traffic onto its old and new paths to resolvea deadlock. The second technique is segmentation,

FR:5

FG:5

FB :5 s1 s2 s3 s4 s5

s6 s7


s1 s2 s3 s4 s5

s6 s7


Figure 2: An update with segment deadlock.

FR:4

FG:4

FB :4

FN :4 s1 s2 s3 s4 s5

s6 s7


s1 s2 s3 s4 s5

s6 s7


Figure 3: An update with splittable deadlock.

which allows to update different flow “segments” inde-pendently of one another. As we discussed, segmen-tation allows an update to complete faster via paral-lelization. In this case, it also helps to resolve certaindeadlocks. Before presenting our techniques in detail(§4), we illustrate them with intuitive examples.Segmentation example. Consider the example inFigure 2, where every flow has size 5. This case presentsa deadlock that prevents flows from being updated allat once. In particular, if we first move FR, link `3,4becomes congested. Similarly, if we first move FB , link`2,3 is then congested.

We resolve this deadlock by segmenting these flows as:FR={s2. . .s3, s3. . .s4, s4. . .s5}, FB={s1. . .s2, s2. . .s3,s3. . .s4}. Then, switches s2 and s6 coordinate to firstmove segment {s2. . .s3} of FR followed by the samesegment of FB . Independently, switches s3 and s7move segment {s3. . .s4} of FB and FR, in this order.Splitting volume example. Consider the example inFigure 3, where every flow has size 4. This case presentsa deadlock because we cannot move FR first withoutcongesting `2,6 or move FB first without congesting `2,3.

We resolve this deadlock by splitting the flows.Switch s2 infers that `2,3 has 2 units of capacity andstarts moving the corresponding fraction of FB ontothat link. This movement releases sufficient capacity tomove FR to `2,6 and `6,3. Once FR is moved, there isenough capacity to complete the mode of FB .

Note that, we could even speed up part of the updateby moving two units of FR simultaneously to movingtwo units of FB at the very beginning. This is what ourdecentralized solution would do.

4. EZ-SEGWAYez-Segway is a mechanism to update the forward-

ing state in a fast and consistent manner: it improvesupdate completion time while preventing forwardinganomalies (i.e., black-hole, loops), and avoiding the riskof link congestion.

To achieve fast updates, our design leverages a smallnumber of crucial, yet simple, update functionalitiesthat are performed by the switches. The central-ized controller leverages its global visibility into net-work conditions to compute and transmit once to the

s0 s1 s2 s3

s4 s5

s6 s7

(a) No reversed order commonswitches

s0 s3s1 s2 s9

s4 s5 s6

s7 s8

(b) A pair of reversed order commonswitches

s0 s2 s6s1 s3 s4 s5

s7

s8

s9

s10

(c) Two pairs of reversed common switches

Figure 4: Different cases of segmentation.

switches the information needed to execute an update.The switches then schedule and perform network up-date operations without interacting with the controller.

As per our problem formulation (§2), we assume thecontroller has knowledge of the initial and final networkconfigurations C,C′. In practice, each network updatemight be determined by some application running onthe controller (e.g., routing, monitoring, etc.). In oursettings, the controller is centralized and we leave itfor future work to consider the case of distributed con-trollers.Controller responsibilities. In ez-Segway, on ev-ery update, the controller performs two tasks: (i) Thecontroller identifies flow segments (to enable paralleliza-tion), and it identifies execution dependencies amongsegments (to guarantee consistency). These dependen-cies are computed using a fast heuristic that classi-fies segments into two categories and establishes, foreach segment pair, which segment must precede anotherone. (ii) The controller computes the dependency graph(to avoid congestion), which encodes the dependenciesamong update operations and the bandwidth require-ments for the update operations.Switch responsibilities. The set of functions atswitches encompasses simple message exchange amongadjacent switches and a greedy selection of update op-erations to be executed based on the above information(i.e., the dependencies among segments) provided bythe controller. These functions are computationally in-expensive and easy to implement in currently availableprogrammable switches, as we show in Section 8.

Next, we first describe in more detail our segmenta-tion technique, which ez-Segway uses to speed up net-work updates while guaranteeing anomaly-free forward-ing during updates. Then, we describe our schedulingmechanism based on segments of flows, which avoidscongestion. Finally, we show how to cast this mech-anism into a simple distributed setting that requiresminimal computational power in the switches.

4.1 Flow segmentationOur segmentation technique provides two benefits: it

speeds up the update completion time of a flow to itsnew path and it reduces the risk of update deadlocksdue to congested links by allowing a more fine-grainedcontrol of the flow update. Segment identification is per-formed by the controller when a network state updateis triggered. We first introduce the idea behind segmen-tation and then describe our technique in details.

Update operation in the distributed approach.Consider the update problem represented in Figure 4a,where a flow F needs to be moved from the old (solidline) path (s0s4 . . . s2s3) to the new (dashed-line) path(s0s6 . . . s2s3). A simple approach would work as fol-lows. A message is sent from s3 to its predecessor onthe new path (i.e., s2) for acknowledging the beginningof the flow update. Then, every switch that receives thismessage forwards it as soon as it installs the new rulefor forwarding packets on the new path. Since the mes-sage travels in the reverse direction of the new path, thereception of such message guarantees that each switchon the downstream path consistently updated its for-warding table to the new path, thus preventing any riskof black-holes or forwarding loops anomalies. Once thefirst switch of the new path (i.e., s0) switches to the newpath, a new message is sent from s0 towards s3 alongthe old path for acknowledging that no packets will beforwarded anymore along the old path, which can betherefore safely removed. Every switch that receivesthis new message removes the old forwarding entry ofthe old path and afterwards forwards it to its successoron the old path. We call this flow update techniquefrom an old path to the new one Basic-Update. It iseasy to observe that Basic-Update prevents forward-ing loops and black-holes (proof in the extended versionof this paper [26]).

Theorem 1. Basic-Update is black-hole- andforwarding-loop-free.

Speeding up a flow update with segmentation.It can be observed that the whole flow update oper-ation could be performed faster. With segmentationthe subpath (s0s4s1) of the old path can be updatedwith Basic-Update to (s0s6s1) while subpath (s1s5s2)of the old path is updated with Basic-Update to(s1s7s2). In fact, s6 does not have to wait for s1 toupdate its path since s1 is always guaranteed to havea forwarding path towards s3. We say that the pairs(s0s4s1, s0s6s1) and (s1s5s2, s1s7s2) are two “segments”of the flow update. Namely, a segment of a flow updatefrom an old path Po to a new path Pn is a pair of sub-paths P ′o and P ′n of Po and Pn, respectively, that startwith the same first switch. We denote a segment by(P ′o, P

′n). The action of updating a segment consists in

performing Basic-Update from P ′o to P ′n.Dependencies among segments. In some cases,finding a set of segments that can be updated in par-allel is not possible. For instance, in Figure 4b, we

need to guarantee that s2 updates its next-hop switchonly after s1 has updated its own next-hop; otherwise,a forwarding loop along (s2, s1, s5) arises. While oneoption would be to update the whole flow using Basic-Update along the reversed new path, some paralleliza-tion is still possible. Indeed, we can update segmentsS1=(s0s4s1, s0s8s2) and S2=(s1s5s2, s1s7s3) in parallelwithout any risk of creating a forwarding loop. In fact,s2, which is a critical switch, is not yet updated and therisk of forwarding oscillation is avoided. After S2 is up-dated, also segment S3=(s2s6s3, s2s1) can be updated.Every time two switches appear in reversed order in theold and new path, one of the two switches has to waituntil the other switch completes its update.Anomaly-free segment update heuristic. It wouldbe tempting to create as many segments as possible sothat the update operation could benefit at most fromparallelization. However, if the chain of dependenciesamong the segments is too long the updates may beunnecessary slowed down. In practice, computing amaximal set of segments that minimize the chain of de-pendencies among them is not an easy task. We relyon a heuristic that classifies segments into two cate-gories called InLoop and NotInLoop and a depen-dency mapping dep : InLoop→NotInLoop that as-signs each InLoop segment to a NotInLoop segmentthat must be executed before its corresponding InLoopsegment to avoid a forwarding loop. This guaranteesthat the longest chain of dependencies is limited to atmost 2. We call such technique Segmented-Update.

Our heuristic works as follows. It identifies the setof common switches SC among the old and new path,denotes by P the pairs of switches in SC that appear inreversed order in the two paths, and sets SR to be theset of switches that appear in SC . It selects a subset PR

of P that has the following properties: (i) for each twopairs of switches (r, s) and (r′, s′) in PR neither r′ nor s′

are contained in the subpath from r to s in the old path,(ii) every switch belonging to a pair in SC is containedin at least a subpath from r to s for a pair (r, s) ofPR, and (iii) the number of pairs in PR is minimized.Intuitively, each pair (s1, s2) of PR represents a pair ofswitches that may cause a forwarding loop unless s2 isupdated after s1. The complexity for computing thesepairs of switches is O(N), where N is the number ofswitches in the network.

The segments are then created as follows. Let S1R(S2R) be the set of switches that appear as the first (sec-ond) element in at least one pair of PR. Let S∗C ⊆ SC bethe set of common vertices that are contained in neitherS1R nor S2R. For each switch s in S∗C∪S1R (S2R), we createa NotInLoop (InLoop) segment S=(Po, Pn), wherePo (Pn) is the subpath of the old (new) path from sto the next switch r in S∗C ∪ S1R ∪ S2R. Morevoer, if Sis an InLoop segment, we set dep(S) to be Sr, whereSr is the segment starting from r. The following theo-rem (proof in the extended version of this paper [26])

guarantees that Segmented-Update does not createblack-holes or forwarding loops.

Theorem 2. Segmented-Update is black-hole-and forwarding-loop-free.

As an example, consider the update depictedin Figure 4c. All the switches in common be-tween the old and new path are colored in redor blue. Let PR be {(s1, s3), (s4, s5)} (coloredin red) for which both properties (i), (ii), and(iii) hold. The resulting NotInLoop segments areS1=(s1s2s3, s1s9s5), S2=(s4s5, s4s6), S3=(s0s1, s0s7s3),the InLoop segments are S4=(s3s4, s3s2s8s1), S5 =(s5s6, s5s10s4), while the dependencies dep(S4) anddep(S5) are S1 and S2, respectively.Segmentation and middleboxes. If the networkpolicy dictates that a specific flow of data traffic musttraverse one or more middleboxes, segmenting a flowmust be carefully performed as the combination of frag-ments from the old and the new path may not traversethe set of middleboxes. In these cases, segmentationcan only be applied if there are no InLoop segments.

4.2 Scheduling updatesPerforming a network update in a distributed man-

ner requires coordination among the switches. As wealready showed in Section 4.1, segments’ updates thatcan cause forwarding loops must be executed accordingto a partial order of all the update operations.

We now consider the problem of performing a networkupdate without congesting any link in the network. Themain challenge of this procedure is how to avoid dead-lock states in which any update operation would cre-ate congestion. We propose a heuristic that effectivelyreduces the risk of deadlocks, which we positively ex-perimented in our evaluation section (Section 6). Ourheuristic is correct, i.e., our heuristic finds a congestion-free migration only if the network update problem ad-mits a congestion-free solution.

Figure 5 shows a case in which an inappropriateschedule of network operations leads to a deadlock situ-ation. FR and FB have size 4 while FG and FN have size3. The only operations that can be performed withoutcongesting a link are updating either FR or FG. Whichone to choose first? Choosing FG will not allow to up-date FB since FG releases only 3 units of free capacityto `2,3. This leads to a deadlock that can only be solvedvia splitting mechanisms, which increases the comple-tion time of the update. Instead, choosing to update FR

releases 4 units of capacity to `2,3, which allows us toupdate FB . In turns, by updating FB we have enoughcapacity to move FG to its new path, and the updatecompletes successfully.

In the previous example, we noted that splitting vol-umes helps avoiding deadlocks; however, this is lesspreferable than finding a sequence of update operationsthat do not split flows for two main reasons: (i) split-ting flows requires more time to complete an update and

s1 s2 s3 s4

s6


s1 s2 s3 s4

s6


πRπG

πB

`6,3:6 `2,6:6 `2,3:0

43

44 4

44 33

(c) Dependency graph

FR : 4

FG : 3

FB : 4

FN : 3

(d) Flow sizes

Figure 5: Choosing a correct update operation.

(ii) the physical hardware may lack support for split-ting flows based on arbitrary weights. In the last case,we stress the fact that ez-Segway can be configured todisable flow splitting during network updates.

Considering again the previous example, we observethat there are two types of deadlocks we must consider.If a network update execution reaches a state in whichit is not possible to make any progress unless by split-ting a flow, we say that there is splittable deadlock inthe system. Otherwise, if even splitting flows cannotmake any progress, we say that there is an unsplittabledeadlock in the network.

Even from a centralized perspective, computing acongestion-free migration is not an easy task [2].Our approach employs a mechanism to centrally pre-compute a dependency graph between links and segmentupdates that can be used to infer which updates aremore critical than others for avoiding deadlocks.The dependency graph. The dependency graph cap-tures the complex set of dependencies between the net-work update operations and the available link capacitiesin the network. Given a pair of current and target con-figurations C,C′, any execution of network operationπ (i.e., updating a part of flow to its new path) requiressome link capacity from the links on the new path andreleases some link capacity on the old path. We formal-ize these dependencies in the dependency graph, whichis a bipartite graph G(Π, L,Efree, Ereq), where the twosubsets of vertices Π and L represent the update opera-tion set and the link set, respectively. Each link vertexì,j is assigned a value representing its current availablecapacity. Sets Efree and Ereq are defined as follows:• Efree is the set of directed edges from vertices in Π tovertices in L. A weighted edge with value v from π to alink ì,j represents the increase of available capacity ofv units at ì,j by performing π.• Ereq is the set of directed edges from vertices L tovertices Π. A weighted edge with value v from linkì,j to π represents the available capacity at ì,j that isneeded to execute π.

Consider the example update of Figure 5(b) and (d),where flows FR, FG, and FB change paths. The corre-sponding dependency graph is shown in Figure 5(c). In

πR

πB

`6,3:3 `2,6:3 `2,3:3

4

44

4

4

4

(a) Splittable deadlock

πR

πB

`3,4:0 `2,3:0

5

5 5

5

(b) Deadlock ex. in Fig. 2

Figure 6: Dependency graph of deadlock cases.

`2,3:0 πR.1

πB.1

5

5

(a) s2 with segmentation

`3,4:0

πR.2

πB.25

5

(b) s3 with segmentation

Figure 7: Dependency graph of deadlock cases.

the graph, circular nodes denote operations and rect-angular nodes denote link capacities. Efree edges areshown in black and Ereq edges are shown in red; theweight annotations reflect the amount of increased andrequested available capacity, respectively. For instance,πR requires 4 units of capacity from `2,6 and `6,3, whileit increases available capacity of `2,3 by the same units.

We now explore the dependency graph with respectto the two techniques that we introduced for solvingdeadlocks: segmentation and splitting flow volumes.Deadlocks solvable by segmentation. Comingback to the example given in Section 3, we show itsdependency graph in Figure 6b. This deadlock is un-solvable by the splitting technique. However, as dis-cusses before, if we allow a packet to be carried in themix of the old and the new path of the same flow, thiskind of deadlock is solvable by using segmentation.

ez-Segway decomposes this deadlocked graph into twonon-deadlocked dependency graphs as shown in Figure7a and 7b, hence enabling the network to be updated.Splittable deadlock. Assume that in our examplewe execute the update operation πG before πR. AfterFG is moved to the new path, the dependency graphbecomes the one in Figure 6a. In this case, every linkhas 3 units of capacity but it is impossible to continuethe update. However, if we allow the traffic of flowFR and FB to be carried in both the old path and thenew path at the same time, we can move 3 units ofFR and FB to the new path and continue the updatethat enables updating the remaining part of flows. Thedeadlock would not be splittable if the capacity of therelevant links was zero, as shown in Figure 6b.

In the presence of a splittable deadlock, there existsa splittable flow Fp and there is a switch s in the newsegment of Fp. Switch s detects the deadlock and deter-mines the amount of Fp’s volume that can be split ontothe new segment. This is taken as the minimum of theavailable capacity on the s’s outgoing link and the nec-essary free capacity for the link in the dependency cycleto enable another update operation at s. An unsplit-table deadlock corresponds to the state in which thereis a cycle in the dependency graph where each link has

zero residual capacity and it is not possible to releaseany capacity from the links in the cycle.Congestion-free heuristic. Having categorized thespace of possible deadlocks, we now introduce ourscheduling heuristic, called ez-Schedule, whose goalis to perform a congestion-free update as fast as possi-ble. The main goal is to avoid both unsplittable dead-locks, which can only be solved by violating congestion-freedom, and splittable deadlocks, which require moreiterations to perform an update since flows are notmoved in one single phase.

ez-Schedule works as follows. It receives as inputan instance of the network update problem where eachflow is already decomposed into segments. Each flowsegmentation is updated with Segmented-Update,which means that each segment is updated directly fromits old path to the new one if there is enough capacity.Hence, each segment corresponds to a network updatenode in the dependency graph of the input instance.Each switch assigns to every segment that contains theswitch in its new path a priority level based on the fol-lowing key structure in the dependency graph. An up-date node π in the dependency graph is critical at aswitch s if (i) s is the first switch of the segment to beupdated, and (ii) executing π frees some capacity thatcan directly be used to execute another update nodeoperation that would otherwise be not executable (i.e.,even if every other update node operation could be pos-sibly executed). A critical cycle is a cycle that containsa critical update node.

ez-Schedule assigns low priority to all the segment(i.e., a network update node) that do not belong to anycycle in the dependency graph. These update opera-tions consume useful resources that are needed in orderto avoid splittable deadlocks and, even worse, unsplit-table deadlocks, which correspond to the presence ofcycles with zero residual capacities in the dependencygraph, as previously described. We assign medium pri-ority to all the remaining segments that belong only tonon-critical cycles, while we assign higher priority to allthe updates that belong to at least one critical cycles.This guarantees that updates belonging to a critical cy-cle are executed as soon as possible so that the risk ofincurring in a splittable or unsplittable deadlock is re-duced. Each switch schedules its network operationsas follows. It only considers segments that needs to berouted through its outgoing links. Among them, seg-ment update operations with lower priority should notbe executed before operations with higher priority un-less a switch detects that there is enough bandwidthfor executing a lower level update operations withoutundermining the possibility of executing higher priorityupdates when they will be executable. We run a simpleBreadth-First-Search (BFS) tree rooted at each updateoperation node to determine which update operationsbelong to at least one critical cycle. In addition to thepriority values of each flow, the updates must satisfy the

constraints imposed by Segmented-Update, if thereare any (i.e., there is at least one InLoop segment).

We can prove that ez-Schedule is correct (proof inthe extended version of this paper [26]), i.e., as long as itthere is an executable network update operation there isno congestion in the network, and that the worst casecomplexity for identifying a critical cycle for a givenupdate operation is O(|Π|+|L|+|Π|×|L|)'O(|Π|×|L|).Consequently, for all update operations the complexityis O(|Π|2×|L|).Theorem 3. ez-Schedule is correct for the net-

work update problem.

4.3 Distributed coordinationWe now describe the mechanics of coordination dur-

ing network updates.First phase: the centralized computation. Asdescribed in Sect. 4.1, to avoid black-holes and forward-ing loops, each segment can be updated using a Basic-Update technique unless there are InLoop segments.These dependencies are computed by the controller inthe initialization phase and transmitted to each switchin order to coordinate the network update correctly.

As described in Sect. 4.2, the scheduling mechanismassigns one out of three priority levels to each segmentupdate operation. The centralized controller is respon-sible for performing this computation and sending toeach switch a InstallUpdate message that encodesthe segment update operations (and their dependencies)that must be handled by the switch itself in addition tothe priority level of the segment update operations.Second phase: the distributed computation.Each switch s receives from the centralized controllerthe following information regarding each single segmentS that traverses s in the new path: its identifier, itspriority level (i.e., high, medium, or low), its amount oftraffic volume, the identifier of the switch that precedes(succeeds) it along the new and old path, and whetherthe receiving switch is the initial or final switch in thenew path of S and in the old and new path of the flowthat contains S as a segment. If the segment is of typeInLoop, the identifier of the segment that has to beupdated before this one is also provided. Each switchin addition knows the capacity of its outgoing links andmaintains memory of the amount of capacity that isused by the flows at each moment in time.

Upon receiving this information from the controller,each switch performs the following initial actions foreach segment S that it received: it installs the new pathand it removes the old path. The messages exchangedby the switches for performing these three operationsare described in detail the next three paragraphs. Wewant to stress the fact that all the functionalities ex-ecuted in the switches consist of simple message ex-changes and basic ranking of update operations that arecomputationally inexpensive and easy to implement incurrently available programmable switches. Those op-erations are similar to those performed by MPLS to

install labels in the switches [29]. Yet, MPLS does notprovide any mechanism to schedule the path installationin a congestion-free manner.Installing the new path of a segment. Theinstallation of the new path is performed by iterativelyreserving along the reversed new path the bandwidth re-quired by the new flow. The last switch on the new pathof segment S sends a GoodToMove message to his prede-cessor along the new path. The goal of this message is toacknowledge the receiver that the downstream path isset up to receive the traffic volume of S. Upon receivinga GoodToMove message for a segment S, a switch checksif there is enough bandwidth on the outgoing link to ex-ecute the update operation. If not, it waits that enoughbandwidth will be available when some flows will be re-moved from the outgoing link. In that case, it checksif there are segment update operations that require theoutgoing link and have higher priority than S. If not,the switch executes the update operation. Otherwise, itchecks whether the residual capacity of the link minusthe traffic volume of the segment is enough to execute inthe future all the higher priority update operations. Inthat case, the switch executes the update operations. Ifthe switch successfully performs the update operation,it updates the residual capacity of the outgoing link andit sends a GoodToMove message to its predecessor alongthe new path of S. If the switch has no predecessoralong the new path of S, i.e., it is the first switch of thenew path of S, it sends a Removing message to its suc-cessor in the old path. If the receiving switch is the lastswitch of an InLoop segment S′, it sends a GoodToMovemessage to its predecessor on dep(S′).Removing the old path of a segment. Uponreceiving a Removing message for a segment S, if thereceiving switch is not a switch in common with thenew path that has not yet installed the new flow entry,it removes the entry for the old path and it forwardsthe message to its successor in the old path. Otherwise,it puts on hold the Removing message until it installsthe new path. If the switch removed the old path, itupdates the capacity of its outgoing links and checkswhether there was a dependency between the segmentthat was removed and any segment that can be executedat the receiving switch. In that case, it executes theseupdate operations according to their priorities and theresidual capacity (as explained above) and propagatesthe GoodToMove that were put on hold.

4.4 Dealing with failuresWhile the network update process must deal with link

and switch failures, we argue that doing so from withinthe switches simply introduces unwarranted complexity.Thus, we deliberately avoid dealing with failures in ourdistributed coordination update mechanism.

As with a centralized approach, if a switch or linkfails during an update, a new target configuration mustbe computed. Recall that the ez-Segway is responsiblefor updating from an initial configuration to the final

one but not for computing them. We believe that con-troller is the best place to re-compute a new global de-sired state and start a new update. Note that in thepresence of a switch or link failure, our update processstops at some intermediate state. Once the controlleris notified of the failure,it halts the updates and queriesthe switches to know which update operations were per-formed and uses this information to reconstruct the cur-rent network state and compute the new desired one.

This process can be optimized to minimize recoverydelay. A natural optimization is to have switches asyn-chronously sending a notification to the controller uponperforming an update operation, thus enabling the con-troller to keep closer in sync with the data plane state.

As for the messages between switches, we posit thatthese packets are sent with the highest priority so thatthey are not dropped due to congestion and that theirtransmission is retried if a timeout expires before anacknowledgment is received. When a message is notdelivered for a maximum number of times, we behaveas though the link has failed.

5. IMPLEMENTATIONWe implemented an unoptimized ez-Segway proto-

type written as 6.5K LoC in Python. The prototypeconsists of a global controller that runs centrally as a sin-gle instance, and a local controller that is instantiatedon every switch. The local controller is built on top ofthe Ryu controller [31]. The local controller, which ex-ecutes our ez-Schedule algorithm, connects and ma-nipulates the data-plane state via OpenFlow, and send-s/receives UDP messages to/from other switches. More-over, the local controller communicates with the globalcontroller to receive the pre-computed scheduling infor-mation messages and to send back an acknowledgmentonce an update operation completes.

6. PROTOTYPE EVALUATIONWe now evaluate the update time performance of

our prototype through emulation in Mininet [10]. Wecompare ez-Segway with a Centralized approach, in-spired by Dionysus [14], that runs the same schedul-ing algorithm of ez-Segway but coordination among theswitches is delegated to a central controller. To makethe comparison fair, we empower Centralized with seg-mentation and splitting volume support, both of whichdo not exist in Dionysus.Experimental setup. We run all our experiments ona dedicated server with 16 cores at 2.60 GHz with hyper-threading, 128 GB of RAM and Ubuntu Linux 14.04.The computation in the central controller is parallelizedacross all cores. Deadlocks are detected by means of atimeout, which we set to 150 ms.

We consider two real WAN topologies: B4 [13] – theglobally-deployed software defined WAN at Google –and the layer 3 topology of the Internet2 network [12].Without loss of generality, we assume link capacities of

B4 50th−ile B4 90th−ile B4 99th−ile I2 50th−ile I2 90th−ile I2 99th−ile

0

200

400

600

800

1000

Cent. ez−Seg. Cent. ez−Seg. Cent. ez−Seg. Cent. ez−Seg. Cent. ez−Seg. Cent. ez−Seg.

Tim

e [m

s]

Computation Coordination

Figure 8: Update time of ez-Segway versus a Cen-tralized approach for B4 and Internet2.

1 Gbps. We place the controller at the centroid switchof the network, i.e., the switch that minimizes the max-imum latency towards the farthest switch. We computethe latency of a link based on the geographical distancebetween its endpoints and the signal propagation speedthrough optical fibers (i.e., ∼200,000 km/s).

Similarly to the methodology in [8], we generateall flows of a network configuration by selecting non-adjacent source and destination switch pairs at randomand assigning them a traffic volume generated accordingto the gravity model [30]. For a given source-destinationpair (s, d), we compute 3 paths from s to d and equallyspread the (s, d) traffic volume among these 3 paths. Tocompute each path, we first select a third transit nodet at random and then we compute a cycle-free shortestpath from s to d that traverses t. If it does not exist, wechoose a different random transit node. We guaranteethat the latency of the chosen paths is at most a factorof 1.5 greater than the latency of the source-destinationshortest path. Starting from an empty network config-uration that does not have any flow, we generate 1, 000continuous network configurations. If the volumes ofthe flows in a configuration exceed the capacity of atleast one link, we iteratively remove flows until the re-maining ones fit within the given link capacities. Theresulting network updates consist of a modification ofat least (at most) 176 and 188 (200 and 276) flows inB4 and Internet2, respectively. The experiment goesthrough these 1, 000 sequential update configurationsand we measure both the time for completing each net-work update and each individual flow update. The net-work update completion time corresponds to the slowestflow completion time among the flows that were sched-uled in the update.

We break down the update time by the amount ofcomputation performed at the central controller for thescheduling computation of the network updates and dueto coordination time among the switches to perform thenetwork update. Our coordination time is measured asthe interval between the time when the first controller-to-switch is sent until when the last switch notifies thecontroller that it has completed all update operations.We report the 50th, 90th, and 99th percentile updatetime across all the update configurations.ez-Segway outperforms Centralized. Figure 8shows the results for the update completion time. ForB4, ez-Segway is 45% faster than Centralized in the me-

0

25

50

75

100

0 250 500 750 1000

Flow update time [ms]

Fra

ction o

f flow

s

B4 − ez−Segway B4 − Centralized I2 − ez−Segway I2 − Centralized

Figure 9: Flow update time of ez-Segway versus aCentralized approach for B4 and Internet2.

Overhead Mean Max.Rule 1.37(±0.57) 4Message 3.16(±3.62) 41

Table 1: Overheads for flow splitting operations.

dian case and 57% faster for the 99th percentile of up-dates, a crucial performance gain for the scenarios con-sidered. We observed a similar trend for I2, where thecompletion time is 38% shorter than Centralized at the90th percentile. It should be noted that we have adoptedalmost worst conditions for ez-Segway in this compari-son in that the central controller is ideally placed at thenetwork centroid. Importantly, our approach is able tocomplete all the network updates, despite overcoming203 and 68 splittable deadlocks in B4 and Internet2, re-spectively. Also, ez-Segway exchanges 35% of the mes-sages sent by Centralized.Controller-to-switch communication is the Cen-tralized bottleneck. We note that the computationtime at the controller represents a small fraction of theoverall update time (i.e., ≤ 20%), showing that themain bottleneck of a network update is the coordina-tion time among the network devices, where ez-Segwaysignificantly outperforms Centralized. As we show laterwith micro-benchmarks on a real switch in §8, the com-putation within the switch is in the order of 3−4 ms.This means that our approach is essentially limited bythe propagation latency of the speed of light, which can-not be improved with better scheduling algorithms.ez-Segway speeds up the slowest flows. Figure 9shows the distribution of the update completion timeof individual flows across all network updates. We ob-serve that ez-Segway (solid lines) not only consistentlyoutperforms the centralized approach (dashed lines) inboth B4 and Internet2, but it also reduces the long-tail visible in Centralized, a fundamental improvementto network’s responsiveness and recovery after failures.We observed that the maximum flow update time is456 ms in ez-Segway and 1634 ms in Centralized for B4and 350 ms in ez-Segway and 673 ms in Centralized forInternet2. The mean flow update time is 26% slowerand 51% slower with Centralized for both B4 and In-ternet2, respectively.ez-Segway overcomes deadlocks with low over-heads. We now measure the cost of resolving a dead-lock by splitting flows in terms of forwarding table over-head and message exchange overhead in the same exper-

Abovenet 25% Abovenet 75% Ebone 25% Ebone 75% Exodus 25% Exodus 75% Sprint 25% Sprint 75% Telstra 25% Telstra 75% Tiscali 25% Tiscali 75%

0

200

400

Ce

ntr

aliz

ed

ez−

Se

gw

ay

Ce

ntr

aliz

ed

ez−

Se

gw

ay

Ce

ntr

aliz

ed

ez−

Se

gw

ay

Ce

ntr

aliz

ed

ez−

Se

gw

ay

Ce

ntr

aliz

ed

ez−

Se

gw

ay

Ce

ntr

aliz

ed

ez−

Se

gw

ay

Ce

ntr

aliz

ed

ez−

Se

gw

ay

Ce

ntr

aliz

ed

ez−

Se

gw

ay

Ce

ntr

aliz

ed

ez−

Se

gw

ay

Ce

ntr

aliz

ed

ez−

Se

gw

ay

Ce

ntr

aliz

ed

ez−

Se

gw

ay

Ce

ntr

aliz

ed

ez−

Se

gw

ay

Update

tim

e [m

s]

Approach

Centralized

ez−Segway

Figure 10: Update time of ez-Segway versus Centralized for various simulation settings.

imental setting used so far. Splitting flows increases thenumber of forwarding entries since both the new and theold flow entries must be active at the same time in orderto split a flow among the two of them. Moreover, split-ting flows requires coordination among switches, whichleads to an overhead in the message communication.We compare ez-Segway against a version of ez-Segwaythat does not split flows and completes an update bycongesting links. Table 1 shows that the average (max-imum) number of additional forwarding entries that areinstalled in the switches is 1.37 (4). As for the mes-sage exchange overhead, we observe that the average(maximum) number of additional messages in the en-tire network is 3.16 (41). Thus, we conclude that theoverheads are negligible.

7. LARGE-SCALE SIMULATIONSWe turn to simulations to study the performance of

our approach on large scale Internet topologies. Wecompare ez-Segway against the Centralized Dionysus-like approach, which we defined in §6. We measure thetotal update time to install updates on 6 real topologiesannotated with link delays and weights as inferred bythe RocketFuel engine [20]. We set link capacities be-tween 1∼100 Gbps, inversely proportional to weights.We place the controller at the centroid switch.

We generate flows using the same approach as in ourprototype experiments. We run simulations for a num-ber of flows varying from 500 to 1,500 and report resultsfor 1,000 flows as we observed qualitatively similar be-haviors. Since the absolute values of network updatetime strongly depend on the number of flows, we focuson the performance gain factor. We generate updatesby simulating link failures that cause a certain percent-age p of flows to be rerouted along new shortest paths.We ran experiments for 10%, 25%, 50%, and 75%; forbrevity, we report results for 25% and 75%. For everysetting of topology, flows, and failure rate, we generate10 different pairs of old and new network configurations,and report the average update completion time and itsstandard deviation. Figure 10 shows our results, whichdemonstrate that ez-Segway reduces the update com-pletion time by a factor of 1.5 − 2. In practice, theultimate gains of using ez-Segway will depend on spe-cific deployment scenarios (e.g., WAN, datacenter) andmight not be significant in absolute terms in some cases.

8. HARDWARE FEASIBILITYOpenFlow. To assess the feasibility of deploying ez-Segway on real programmable switches, we tested ourprototype on a Centec V 580 switch [4], which runs alow-power 528MHz core Power PC with 1GB of RAM,wherein we execute the ez-Segway local-controller. Ourcode runs in Python and is not optimized. We observethat executing the local controller in the switch requiredminimal configuration effort as the code remained un-changed.

We first run a micro-benchmark to test the perfor-mance of the switch for running the scheduling com-putation. Based on Internet2 topology tests, we ex-tract a sequence of 79 updates with 50 flows each, thatwe send to the switch while we measure compute time.We measure a mean update processing time of 3.4 mswith 0.5 ms standard deviation. This indicates thatour approach obtains satisfactory performance despitethe low-end switch CPU. Moreover, the observed CPUutilization for handling messages is negligible.

As a further thought for optimizing ez-Segway, weasked ourselves whether current switches are suitablefor moving the scheduling computation from the cen-tralized controller to the switches themselves. Thiswould be a natural and intuitive optimization since ourscheduling algorithm consists of an iterative computa-tion of the update operations priorities for each switch.This would be beneficial to the system when the num-ber of switches in the network is much larger than thenumber of cores in the central controller. We found thatprocessing the dependency graph for the same sequenceof 79 updates of the Internet2 topology with ∼200 flowsin the entire network with our heuristic takes on avarage22 ms with 6.25 ms standard deviation.P4. To further illustrate the feasibility of realizingsimple network update functionality in switches, we ex-plored implementing ez-Segway in P4 [1]. We regardthis as a thought experiment that finds a positive an-swer. However, the need to delegate update functions tothe switch data plane is unclear since the main perfor-mance bottleneck in our results is due to coordination.

Using P4, we are able to express basic ez-Segwayfunctionality as a set of handlers for InstallUpdate,GoodToMove, and Removing messages. The InstallUp-date handler receives a new flow update message andupdates the switch state. If the switch is the last switchon the new path, it performs the flow update and sends

a GoodToMove message to its predecessor. Upon receiv-ing a GoodToMove message, a switch checks whether thecorresponding update operation is allowed by ensuringthat the available capacity on the new path is higherthan the flow volume. In that case, the switch performsthe update operation, updates the available capacity,and forwards the GoodToMove message to its predeces-sor (until the first switch is reached). When the lackof available capacity prevents an update operation, theswitch simply updates some bookkeeping state to beused later. Once a GoodToMove message is received, theswitch also attempts to perform any update operationwhich was pending due to unavailable capacity. Uponreceiving a Removing message, the switch simply unin-stalls the flow state and propagates the message to thedownstream switch till the last one. For sake of presen-tation, the above does not describe priority handling.

While the above functionality is relatively simple,some of the language characteristics of P4 make it non-trivial to express it. P4 does not have iteration, recur-sion, nor queues data structures. Moreover, P4 switchescannot craft new messages; they can only modify fieldsin the header of the packet that they are currently pro-cessing [5]. To overcome these limitations of the lan-guage, we perform loop unrolling of the ez-Segway logicand adopt a header format that contains the union ofall fields in all ez-Segway messages. More details on ourP4 implementation appear in the extended version ofthis paper [26].

9. RELATED WORKThe network update problem has been widely studied

in the recent literature [2,3,14,16,18,22–24,27,28,32,34].These works use a centralized architecture in which anSDN controller computes the sequence of update op-erations and actively coordinates the switches to per-form these operations. The work in [23] relies on clock-synchronization for performing fast updates, which are,however, non-consistent due to synchronization impre-cisions and non-deterministic delays in installing for-warding rules [9, 14]. In contrast, ez-Segway speeds upnetwork updates by delegating the task of coordinatingthe network update to the switches. To the best of ourknowledge, ez-Segway is the first work that tackles thenetwork update problem in a decentralized manner.

In a sense, our approach follows the line of recent pro-posals to centrally control distributed networks like Fib-bing [33], where the benefits of centralized route compu-tation are combined with the reactiveness of distributedapproaches. We discuss below the most relevant workwith respect to our decentralized mechanism.

Dionysus [14] is centralized scheduling algorithm thatupdates flows atomically (i.e., no traffic splitting).Dionysus computes a graph to encode dependenciesamong update operations. This dependency graph isused by the controller to perform update operationsbased on dynamic conditions of the switches. While

ez-Segway also relies on a similar dependency graph,the controller is only responsible for constructing thegraph whereas the switches use it to schedule updateoperations in a decentralized fashion.

Recent work [34] independently introduced flow seg-mentation and traffic splitting techniques similar toours [25]. However, their work and problem formulationfocuses on minimizing the risk of incurring in a dead-lock in the centralized setting. In contrast, ez-Segwaydevelops flow segmentation to increase the efficiency ofnetwork update completion time.

From an algorithmic perspective, [14] and [2] showedthat, without the ability to split a flow, several formula-tions of the network update problem are NP-hard. Thework in [2] is the only one that provides a poly-timealgorithm that is correct and complete for the networkupdate problem. However, their model allows flows tobe moved on any possible intermediate path (and notonly the initial and final one). Moreover, there are sev-eral limitations. First, the complexity of the algorithmis too high for practical usage. Consequently, the au-thors do not evaluate their algorithm. Last, this workalso assumes a centralized controller that schedules theupdates.

10. CONCLUSIONThis paper explored delegating the responsibility of

executing consistent updates to the switches. We pro-posed ez-Segway, a mechanism that achieves faster com-pletion time of network updates by moving simple, yetclever coordination operations in the switches, while themore expensive computations are performed in the cen-tralized controller. Our approach enables switches toengage as active actors in realizing updates that prov-ably satisfy three properties: black-hole freedom, loopfreedom, and congestion freedom.

In practice, our approach leads to improved updatetimes, which we quantified via emulation and simulationon a range of network topologies and traffic patterns.Our results show that ez-Segway improves network up-date times by up to 45% and 57% at the median andthe 99th percentile, respectively. We also deployed ourapproach on a real OpenFlow switch to demonstrate thefeasibility and low computational overhead, and imple-mented the switch functionality in P4.

Acknowledgements. We would like to thank theanonymous reviewers and our shepherd Jia Wang fortheir feedback. We are thankful to Xiao Chen, PaoloCosta, Huynh Tu Dang, Petr Kuznetsov, Ratul Maha-jan, Jennifer Rexford, Robert Soule, and Stefano Vissic-chio for their helpful comments on earlier drafts of thispaper. This research is (in part) supported by Euro-pean Union’s Horizon 2020 research and innovation pro-gramme under the ENDEAVOUR project (grant agree-ment 644960).

11. REFERENCES[1] P. Bosshart, D. Daly, G. Gibb, et al. P4:

Programming Protocol-Independent PacketProcessors. SIGCOMM Comput. Commun. Rev.,44(3), July 2014.

[2] S. Brandt, K.-T. Forster, and R. Wattenhofer. OnConsistent Migration of Flows in SDNs. InINFOCOM, 2016.

[3] M. Canini, P. Kuznetsov, D. Levin, andS. Schmid. A Distributed and Robust SDNControl Plane for Transactional NetworkUpdates. In INFOCOM, 2015.

[4] Centec Networks. GoldenGate 580 Series.http://bit.ly/2nbtHPA.

[5] H. T. Dang, M. Canini, F. Pedone, and R. Soule.Paxos Made Switch-y. SIGCOMM Comput.Commun. Rev., 46(2), Apr 2016.

[6] K. Foerster, S. Schmid, and S. Vissicchio. Surveyof Consistent Network Updates. CoRR,abs/1609.02305, 2016.

[7] M. Gerola, F. Lucrezia, M. Santuari, E. Salvadori,P. L. Ventre, S. Salsano, and M. Campanella.ICONA: a Peer-to-Peer Approach for SoftwareDefined Wide Area Networks using ONOS. InEWSDN, 2016.

[8] J. He, M. Suchara, M. Bresler, J. Rexford, andM. Chiang. Rethinking Internet TrafficManagement: From Multiple Decompositions to aPractical Protocol. In CoNEXT, 2007.

[9] K. He, J. Khalid, A. Gember-Jacobson, S. Das,C. Prakash, A. Akella, L. E. Li, and M. Thottan.Measuring Control Plane Latency inSDN-Enabled Switches. In SOSR, 2015.

[10] B. Heller, N. Handigol, V. Jeyakumar, B. Lantz,and N. McKeown. Reproducible NetworkExperiments using Container Based Emulation. InCoNEXT, 2012.

[11] C.-Y. Hong, S. Kandula, R. Mahajan, M. Zhang,V. Gill, M. Nanduri, and R. Wattenhofer.Achieving High Utilization with Software-DrivenWAN. In SIGCOMM, 2013.

[12] Internet2. Internet2 Network InfrastructureTopology. http://bit.ly/2nbrh3k.

[13] S. Jain, A. Kumar, S. Mandal, et al. B4:Experience with a Globally-Deployed SoftwareDefined WAN. In SIGCOMM, 2013.

[14] X. Jin, H. H. Liu, R. Gandhi, S. Kandula,R. Mahajan, M. Zhang, J. Rexford, andR. Wattenhofer. Dynamic Scheduling of NetworkUpdates. In SIGCOMM, 2014.

[15] N. Kang, M. Ghobadi, J. Reumann, A. Shraer,and J. Rexford. Efficient Traffic Splitting onCommodity Switches. In CoNEXT, 2015.

[16] N. P. Katta, J. Rexford, and D. Walker.Incremental Consistent Updates. In HotSDN,2013.

[17] D. Levin, M. Canini, S. Schmid, F. Schaffert, andA. Feldmann. Panopticon: Reaping the Benefitsof Incremental SDN Deployment in EnterpriseNetworks. In USENIX ATC, 2014.

[18] H. H. Liu, X. Wu, M. Zhang, L. Yuan,R. Wattenhofer, and D. Maltz. zUpdate:Updating Data Center Networks with Zero Loss.In SIGCOMM, 2013.

[19] A. Ludwig, J. Marcinkowski, and S. Schmid.Scheduling Loop-free Network Updates: It’s Goodto Relax! In PODC, 2015.

[20] R. Mahajan, N. T. Spring, D. Wetherall, andT. E. Anderson. Inferring Link Weights usingEnd-to-End Measurements. In IMW, 2012.

[21] R. Mahajan and R. Wattenhofer. On ConsistentUpdates in Software Defined Networks. InHotNets, 2013.

[22] J. McClurg, H. Hojjat, P. Cerny, and N. Foster.Efficient Synthesis of Network Updates. In PLDI,2015.

[23] T. Mizrahi and Y. Moses. Software DefinedNetworks: It’s About Time. In INFOCOM, 2016.

[24] T. Mizrahi, O. Rottenstreich, and Y. Moses.TimeFlip: Scheduling Network Updates withTimestamp-based TCAM Ranges. In INFOCOM,2015.

[25] T. D. Nguyen, M. Chiesa, and M. Canini.Towards Decentralized Fast Consistent Updates.In ANRW, 2016.

[26] T. D. Nguyen, M. Chiesa, and M. Canini.Decentralized Consistent Network Updates inSDN with ez-Segway. CoRR, abs/1703.02149,2017.

[27] P. Peresıni, M. Kuzniar, M. Canini, andD. Kostic. ESPRES: Transparent SDN UpdateScheduling. In HotSDN, 2014.

[28] M. Reitblatt, N. Foster, J. Rexford,C. Schlesinger, and D. Walker. Abstractions forNetwork Update. In SIGCOMM, 2012.

[29] RFC 3209. RSVP-TE: Extensions to RSVP forLSP Tunnels.

[30] M. Roughan. Simplifying the Synthesis of InternetTraffic Matrices. SIGCOMM Comput. Commun.Rev., 35(5), Oct. 2005.

[31] Ryu SDN Framework. http://osrg.github.io/ryu/.

[32] S. Vissicchio and L. Cittadini. FLIP the (Flow)Table: Fast LIghtweight Policy-preserving SDNUpdates. In INFOCOM, 2016.

[33] S. Vissicchio, O. Tilmans, L. Vanbever, andJ. Rexford. Central Control Over DistributedRouting. In SIGCOMM, 2015.

[34] W. Wang, W. He, J. Su, and Y. Chen. Cupid:Congestion-free Consistent Data Plane Update InSoftware Defined Networks. In INFOCOM, 2016.

http://bit.ly/2nbtHPA

http://bit.ly/2nbrh3k

http://osrg.github.io/ryu/

decentralized consistent updates in sdn

Documents