Achieving High Utilization with Software-Driven WANlierranli/coms6998-10SDNFall... · 1In some...

Achieving High Utilization with Software-Driven WAN

Chi-Yao Hong (UIUC) Srikanth Kandula Ratul Mahajan Ming ZhangVijay Gill Mohan Nanduri Roger Wattenhofer (ETH)

Microsoft

Abstract— We present SWAN, a system that boosts theutilization of inter-datacenter networks by centrally control-ling when and how much tra�c each service sends and fre-quently re-configuring the network’s data plane to matchcurrent tra�c demand. But done simplistically, these re-configurations can also cause severe, transient congestionbecause di↵erent switches may apply updates at di↵erenttimes. We develop a novel technique that leverages a smallamount of scratch capacity on links to apply updates in aprovably congestion-free manner, without making any as-sumptions about the order and timing of updates at individ-ual switches. Further, to scale to large networks in the faceof limited forwarding table capacity, SWAN greedily selectsa small set of entries that can best satisfy current demand.It updates this set without disrupting tra�c by leveraging asmall amount of scratch capacity in forwarding tables. Ex-periments using a testbed prototype and data-driven simu-lations of two production networks show that SWAN carries60% more tra�c than the current practice.

Categories and Subject Descriptors: C.2.1 [Computer-

Communication Networks]: Network Architecture and Design

Keywords: Inter-DC WAN; software-defined networking

1. INTRODUCTIONThe wide area network (WAN) that connects the data-

centers (DC) is critical infrastructure for providers of onlineservices such as Amazon, Google, and Microsoft. Many ser-vices rely on low-latency inter-DC communication for gooduser experience and on high-throughput transfers for relia-bility (e.g., when replicating updates). Given the need forhigh capacity—inter-DC tra�c is a significant fraction ofInternet tra�c and rapidly growing [20]—and unique traf-fic characteristics, the inter-DC WAN is often a dedicatednetwork, distinct from the WAN that connects with ISPs toreach end users [15]. It is an expensive resource, with amor-tized annual cost of 100s of millions of dollars, as it provides100s of Gbps to Tbps of capacity over long distances.

However, providers are unable to fully leverage this in-vestment today. Inter-DC WANs have extremely poor ef-ficiency; the average utilization of even the busier links is40-60%. One culprit is the lack of coordination among theservices that use the network. Barring coarse, static limits

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]’13, August 12–16, 2013, Hong Kong, China.Copyright 2013 ACM 978-1-4503-2056-6/13/08 ...$15.00.

in some cases, services send tra�c whenever they want andhowever much they want. As a result, the network cyclesthrough periods of peaks and troughs. Since it must be pro-visioned for peak usage to avoid congestion, the network isunder-subscribed on average. Observe that network usagedoes not have to be this way if we can exploit the char-acteristics of inter-DC tra�c. Some inter-DC services aredelay-tolerant. We can tamp the cyclical behavior if suchtra�c is sent when the demand from other tra�c is low.This coordination will boost average utilization and enablethe network to either carry more tra�c with the same ca-pacity or use less capacity to carry the same tra�c.1

Another culprit behind poor e�ciency is the distributedresource allocation model of today, typically implementedusing MPLS TE (Multiprotocol Label Switching Tra�c En-gineering) [4, 24]. In this model, no entity has a global viewand ingress routers greedily select paths for their tra�c. Asa result, the network can get stuck in locally optimal routingpatterns that are globally suboptimal [27].

We present SWAN (Software-driven WAN), a system thatenables inter-DC WANs to carry significantly more tra�c.By itself, carrying more tra�c is straightforward—we can letloose bandwidth-hungry services. SWAN achieves high e�-ciency while meeting policy goals such as preferential treat-ment for higher-priority services and fairness among similarservices. Per observations above, its two key aspects are i)globally coordinating the sending rates of services; and ii)centrally allocating network paths. Based on current servicedemands and network topology, SWAN decides how muchtra�c each service can send and configures the network’sdata plane to carry that tra�c.

Maintaining high utilization requires frequent updates tothe network’s data plane, as tra�c demand or network topol-ogy changes. A key challenge is to implement these updateswithout causing transient congestion that can hurt latency-sensitive tra�c. The underlying problem is that the updatesare not atomic as they require changes to multiple switches.Even if the before and after states are not congested, con-gestion can occur during updates if tra�c that a link is sup-posed to carry after the update arrives before the tra�cthat is supposed to leave has left. The extent and durationof such congestion is worse when the network is busier andhas larger RTTs (which lead to greater temporal disparityin the application of updates). Both these conditions hold

1In some networks, fault tolerance is another reason for lowutilization; the network is provisioned such that there is am-ple capacity even after (common) failures. However, in inter-DC WANs, tra�c that needs strong protection is a smallsubset of the overall tra�c, and existing technologies cantag and protect such tra�c in the face of failures (§2).

for our setting, and we find that uncoordinated updates leadto severe congestion and heavy packet loss.

This challenge recurs in every centralized resource allo-cation scheme. MPLS TE’s distributed resource allocationcan make only a smaller class of “safe” changes; it cannotmake coordinated changes that require one flow to move inorder to free a link for use by another flow. Further, recentwork on atomic updates, to ensure that no packet experi-ences a mix of old and new forwarding rules [23, 29], doesnot address our challenge. It does not consider capacity lim-its and treats each flow independently; congestion can stilloccur due to uncoordinated flow movements.

We address this challenge by first observing that it is im-possible to update the network’s data plane without creatingcongestion if all links are full. SWAN thus leaves “scratch”capacity s (e.g., 10%) at each link. We prove that this en-ables a congestion-free plan to update the network in atmost d1/se–1 steps. Each step involves a set of changes toforwarding rules at switches, with the property that therewill be no congestion independent of the order and tim-ing of those changes. We then develop an algorithm tofind a congestion-free plan with the minimum number ofsteps. Further, SWAN does not waste the scratch capacity.Some inter-DC tra�c is tolerant to small amounts of conges-tion (e.g., data replication with long deadlines). We extendour basic approach to use all link capacity while guaran-teeing bounded-congestion updates for tolerant tra�c andcongestion-free updates for the rest.

Another challenge that we face is that fully using net-work capacity requires many forwarding rules at switches,to exploit many alternative paths through the network, butcommodity switches support a limited number of forwardingrules.2 Analysis of a production inter-DC WAN shows thatthe number of rules required to fully use its capacity exceedsthe limits of even next generation SDN switches. We addressthis challenge by dynamically changing, based on tra�c de-mand, the set of paths available in the network. On thesame WAN, our technique can fully use network capacitywith an order of magnitude fewer rules.

We develop a prototype of SWAN, and evaluate our ap-proach through testbed experiments and simulations usingtra�c and topology data from two production inter-DCWANs. We find that SWAN carries 60% more tra�c thanMPLS TE and it comes within 2% of the tra�c carried byan optimal method that assumes infinite rule capacity andincurs no update overhead. We also show that changes tonetwork updates are quick, requiring only 1-3 steps.

While our work focuses on inter-DC WANs, many of itsunderlying techniques are useful for other WANs as well(e.g., ISP networks). We show that even without control-ling how much tra�c enters the network, an ability that isunique to the inter-DC context, our techniques for global re-source and change management allow the network to carry16-25% more tra�c than MPLS TE.

2. BACKGROUND AND MOTIVATIONInter-DC WANs carry tra�c from a range of services,

where a service is an activity across multiple hosts. Ex-ternally visible functionality is usually enabled by multiple

2The limit stems from the amount of fast, expensive memoryin switches. It is not unique to OpenFlow switches; numberof tunnels that MPLS routers support is also limited [2].

internal services (e.g., search may use Web-crawler, index-builder, and query-responder services). Prior work [6] andour conversations with operators reveal that services fall intothree broad types, based on their performance requirements.

Interactive services are in the critical path of end userexperience. An example is when one DC contacts anotherin the process of responding to a user request because notall information is available in the first DC. Interactive tra�cis highly sensitive to loss and delay; even small increases inresponse time (100 ms) degrade user experience [35].

Elastic services are not in the critical path of user expe-rience but still require timely delivery. An example is repli-cating a data update to another DC. Elastic tra�c requiresdelivery within a few seconds or minutes. The consequencesof delay vary with the service. In the replication example,the risk is loss of data if a failure occurs or that a user willobserve data inconsistency.

Background services conduct maintenance and provi-sioning activities. An example is copying all the data ofa service to another DC for long-term storage or as a pre-cursor to running the service there. Such tra�c tends tobe bandwidth hungry. While it has no explicit deadline ora long deadline, it is still desirable to complete transfers assoon as possible—delays lower business agility and tie upexpensive server resources.

In terms of overall volumes, interactive tra�c is the small-est subset and background tra�c is the largest.

2.1 Current traffic engineering practiceMany WANs are operated using MPLS TE today. To

e↵ectively use network capacity, MPLS TE spreads tra�cacross a number of tunnels between ingress-egress routerpairs. Ingress routers split tra�c, typically equally usingequal cost multipath routing (ECMP), across the tunnelsto the same egress. They also estimate the tra�c demandfor each tunnel and find network paths for it using the con-strained shortest path first (CSPF) algorithm, which iden-tifies the shortest path that can accommodate the tunnel’stra�c (subject to priorities; see below).

With MPLS TE, service di↵erentiation can be providedusing two mechanisms. First, tunnels are assigned priori-ties and di↵erent types of services are mapped to di↵erenttunnels. Higher priority tunnels can displace lower prioritytunnels and thus obtain shorter paths; the ingress routers ofdisplaced tunnels must then find new paths. Second, pack-ets carry di↵erentiated services code point (DSCP) bits inthe IP header. Switches map di↵erent bits to di↵erent pri-ority queues, which ensures that packets are not delayedor dropped due to lower-priority tra�c; they may still bedelayed or dropped due to equal or higher priority tra�c.Switches typically have only a few priority queues (4–8).

2.2 Problems of MPLS TEInter-DC WANs su↵er from two key problems today.

Poor e�ciency: The amount of tra�c the WAN carriestends to be low compared to capacity. For a productioninter-DC WAN, which we call IDN (§6.1), we find that theaverage utilization of half the links is under 30% and of threein four links is under 50%.

Two factors lead to poor e�ciency. First, services sendwhenever and however much tra�c they want, without re-gard to the current state of the network or other services.This lack of coordination leads to network swinging between

(a)

��

��

��

��

��

��

��

(b)

��

��

��

��

��

��

(c)

��

��

��

��

��

��

��

Figure 1: Illustration of poor utilization. (a) Daily tra�cpattern on a busy link in a production inter-DC WAN.(b) Breakdown based on tra�c type. (c) Reduction inpeak usage if background tra�c is dynamically adapted.

𝑅 𝑅 𝑅 𝑅

𝑅 𝑅 𝑅 𝐹

𝐹 𝐹

(a) Local path selection

𝑅 𝑅 𝑅 𝑅

𝑅 𝑅 𝑅 𝐹 𝐹 𝐹

(b) Globally optimal paths

Figure 2: Ine�cient routing due to local allocation.

over- and under-subscription. Figure 1a shows the load overa day on a busy link in IDN. Assuming capacity matchespeak usage (a common provisioning model to avoid conges-tion), the average utilization on this link is under 50%. Thus,half the provisioned capacity is wasted. This ine�ciency isnot fundamental but can be remedied by exploiting tra�ccharacteristics. As a simple illustration, Figure 1b separatesbackground tra�c. Figure 1c shows that the same total traf-fic can fit in half the capacity if background tra�c is adaptedto use capacity left unused by other tra�c.

Second, the local, greedy resource allocation model ofMPLS TE is ine�cient. Consider Figure 2 in which eachlink can carry at most one flow. If the flows arrive in theorder FA, FB , and FC , Figure 2a shows the path assignmentwith MPLS TE: FA is assigned to the top path which is oneof the shortest paths; when FB arrives, it is assigned to theshortest path with available capacity (CSPF); and the samehappens with FC . Figure 2b shows a more e�cient routingpattern with shorter paths and many links freed up to carrymore tra�c. Such an allocation requires non-local changes,e.g., moving FA to the lower path when FB arrives.

Partial solutions for such ine�ciency exist. Flows canbe split across two tunnels, which would divide FA acrossthe top and bottom paths, allowing half of FB and FC touse direct paths; a preemption strategy that prefers shorterpaths can also help. But such strategies do not address thefundamental problem of local allocation decisions [27].

Poor sharing: Inter-DC WANs have limited support forflexible resource allocation. For instance, it is di�cult tobe fair across services or favor some services over certainpaths. When services compete today, they typically obtainthroughput proportional to their sending rate, an undesir-able outcome (e.g., it creates perverse incentives for servicedevelopers). Mapping each service onto its own queue atrouters can alleviate problems but the number of services(100s) far exceeds the number of available router queues.Even if we had infinite queues and could ensure fairness onthe data plane, network-wide fairness is not possible without

𝑆 𝑆

𝐷 𝐷 𝑆

𝐷 1/2 1/2

1/2

1/2

(a) Link-level

𝑆 𝑆

𝐷 𝐷 𝑆

𝐷 1/3 1/3

2/3

2/3

(b) Network-wide

Figure 3: Link-level fairness 6= network-wide fairness.

controlling which flows have access to which paths. ConsiderFigure 3 in which each link has unit capacity and each ser-vice (Si!Di) has unit demand. With link-level fairness,S2

!D2

gets twice the throughput of other services. As weshow, flexible sharing can be implemented with a limitednumber of queues by carefully allocating paths to tra�c andcontrol the sending rate of services.

3. SWAN OVERVIEW AND CHALLENGESOur goal is to carry more tra�c and support flexible

network-wide sharing. Driven by inter-DC tra�c character-istics, SWAN supports two types of sharing policies. First,it supports a small number of priority classes (e.g., Inter-active > Elastic > Background) and allocates bandwidthin strict precedence across these classes, while preferringshorter paths for higher classes. Second, within a class,SWAN allocates bandwidth in a max-min fair manner.

SWAN has two basic components that address the funda-mental shortcomings of the current practice. It coordinatesthe network activity of services and uses centralized resourceallocation. Abstractly, it works as:

1. All services, except interactive ones, inform the SWAN

controller of their demand between pairs of DCs. In-teractive tra�c is sent like today, without permissionfrom the controller, so there is no delay.

2. The controller, which has an up-to-date, global view ofthe network topology and tra�c demands, computeshow much each service can send and the network pathsthat can accommodate the tra�c.

3. Per SDN paradigm, the controller directly updates theforwarding state of the switches. We use OpenFlowswitches, though any switch that permits direct pro-gramming of forwarding state (e.g., MPLS ExplicitRoute Objects [3]) may be used.

While the architecture is conceptually simple, we must ad-dress three challenges to realize this design. First, we needa scalable algorithm for global allocation that maximizesnetwork utilization subject to constraints on service prior-ity and fairness. Best known solutions are computationallyintensive as they solve long sequences of linear programs(LP) [9, 26]. Instead, SWAN uses a more practical approachthat is approximately fair with provable bounds and closeto optimal in practical scenarios (§6).

Second, atomic reconfiguration of a distributed system ofswitches is hard to engineer. Network forwarding state needsupdating in response to changes in the tra�c demand ornetwork topology. Lacking WAN-wide atomic changes, thenetwork can drop many packets due to transient congestioneven if both the initial and final configurations are uncon-gested. Consider Figure 4 in which each flow is 1 unit andeach link’s capacity is 1.5 units. Suppose we want to changethe network’s forwarding state from Figure 4a to 4b, perhapsto accommodate a new flow from R

2

to R4

. This change re-quires changes to at least two switches. Depending on the

(a)

𝑅 𝑅

𝑅 𝐹 𝑅 𝐹 (b)

𝑅 𝑅

𝑅

𝐹

𝐹 𝑅

(c)

𝑅 𝑅

𝑅 𝑅 𝐹

𝐹

(d)

𝑅 𝑅

𝑅 𝐹 𝑅 𝐹

(e)

𝑅 𝑅

𝑅 𝐹

𝑅

𝐹 : 50%

(f)

𝑅 𝑅

𝑅 𝑅

𝐹 : 50%

𝐹

Figure 4: Illustration of congestion-free updates. Eachflow’s size is 1 unit and each link’s capacity is 1.5 units.Changing from state (a) to (b) may lead to congestedstates (c) or (d). A congestion-free update sequence is(a);(e);(f);(b).

order in which the switch-level changes occur, the networkreaches the states in Figures 4c or 4d, which have a heavilycongested link and can significantly hurt TCP flows as manypackets may be lost in a burst.

To avoid congestion during network updates, SWAN com-putes a multi-step congestion-free transition plan. Each stepinvolves one or more changes to the state of one or moreswitches, but irrespective of the order in which the changesare applied, there will be no congestion. For the reconfigura-tion in Figure 4, a possible congestion-free plan is: i) movehalf of FA to the lower path (Figure 4e); ii) move FB to theupper path (Figure 4f); and iii) move the remaining half ofFA to the lower path (Figure 4b).

A congestion-free plan may not always exist, and even ifit does, it may be hard to find or involve a large number ofsteps. SWAN leaves scratch capacity of s 2 [0, 50%] on eachlink, which guarantees that a transition plan exists with atmost d1/se�1 steps (which is 9 if s=10%). We then developa method to find a plan with the minimal number of steps.In practice, it finds a plan with 1-3 steps when s=10%.

Further, instead of wasting scratch capacity, SWAN al-locates it to background tra�c. Overall, it guaranteesthat non-background tra�c experiences no congestion dur-ing transitions, and the congestion for background tra�c isbounded (configurable).

Third, switch hardware supports a limited number of for-warding rules, which makes it hard to fully use network ca-pacity. For instance, if a switch has six distinct paths toa destination but supports only four rules, a third of pathscannot be used. Our analysis of a production inter-DCWANillustrates the challenge. If we use k-shortest paths betweeneach pair of switches (as in MPLS), fully using this network’scapacity requires k=15. Installing these many tunnels needsup to 20K rules at switches (§6.5), which is beyond the capa-bilities of even next-generation SDN switches; the BroadcomTrident2 chipset will support 16K OpenFlow rules [33]. Thecurrent-generation switches in our testbed support 750 rules.

To fully exploit network capacity with a limited number ofrules, we are motivated by how the working set of a processis often a lot smaller than the total memory it uses. Sim-ilarly, not all tunnels are needed at all times. Instead, astra�c demand changes, di↵erent sets of tunnels are mostsuitable. SWAN dynamically identifies and installs thesetunnels. Our dynamic tunnel allocation method, which usesan LP, is e↵ective because the number of non-zero variablesin a basic solution for any LP is fewer than the number ofconstraints [25]. In our case, we will see that variables in-clude the fraction of a DC-pair’s tra�c that is carried over a

Inter-DC WAN

Datacenter

SWAN controller

Datacenter

Service host Service broker

Network agent Switch

Figure 5: Architecture of SWAN.

tunnel and the number of constraints is roughly the numberof priority classes times the number of DC pairs. BecauseSWAN supports three priority classes, we obtain three tun-nels with non-zero tra�c per DC pair on average, which ismuch less than the 15 required for a non-dynamic solution.

Dynamically changing rules introduces another wrinkle fornetwork reconfiguration. To not disrupt tra�c, new rulesmust be added before the old rules are deleted; otherwise,the tra�c that is using the to-be-deleted rules will be dis-rupted. Doing so requires some rule capacity to be keptvacant at switches to accommodate the new rules; done sim-plistically, up to half of the rule capacity must be kept va-cant [29], which is wasteful. SWAN sets aside a small amountof scratch space (e.g., 10%) and uses a multi-stage approachto change the set of rules in the network.

4. SWAN DESIGNFigure 5 shows the architecture of SWAN. A logically

centralized controller orchestrates all activity. Each non-interactive service has a broker that aggregates demandsfrom the hosts and apportions the allocated rate to them.One or more network agents intermediate between the con-troller and the switches. This architecture provides scale—by providing parallelism where needed—and choice—eachservice can implement a rate allocation strategy that fits.

Service hosts and brokers collectively estimate the ser-vice’s current demand and limit it to the rate allocated bythe controller. Our current implementation draws on dis-tributed rate limiting [5]. A shim in the host OS estimatesits demand to each remote DC for the next Th=10 secondsand asks the broker for an allocation. It uses a token bucketper remote DC to enforce the allocated rate and tags packetswith DSCP bits to indicate the service’s priority class.

The service broker aggregates demand from hosts and up-dates the controller every Ts=5 minutes. It apportions itsallocation from the controller piecemeal, in time units of Th,to hosts in a proportionally fair manner. This way, Th is themaximum time that a newly arriving host has to wait beforestarting to transmit. It is also the maximum time a servicetakes to change its sending rate to a new allocation. Brokersthat suddenly experience radically larger demands can askfor more any time; the controller does a lightweight compu-tation to determine how much of the additional demand canbe carried without altering network configuration.

Network agents track topology and tra�c with the aidof switches. They relay news about topology changes tothe controller right away and collect and report informationabout tra�c, at the granularity of OpenFlow rules, everyTa=5 minutes. They are also responsible for reliably up-dating switch rules as requested by the controller. Beforereturning success, an agent reads the relevant part of theswitch rule table to ensure that the changes have been suc-cessfully applied.

Controller uses the information on service demands andnetwork topology to do the following every Tc=5 minutes.

1. Compute the service allocations and forwarding planeconfiguration for the network (§4.1, §4.2).

2. Signal new allocations to services whose allocation hasdecreased. Wait for Th seconds for the service to lowerits sending rate.

3. Change the forwarding state (§4.3) and then signalthe new allocations to services whose allocation hasincreased.

4.1 Forwarding plane configurationSWAN uses label-based forwarding. Doing so reduces for-

warding complexity; the complex classification that may berequired to assign a label to tra�c is done just once, at thesource switch. Remaining switches simply read the label andforward the packet based on the rules for that label. We useVLAN IDs as labels.

Ingress switches split tra�c across multiple tunnels (la-bels). We propose to implement unequal splitting, whichleads to more e�cient allocation [13], using group tables inthe OpenFlow pipeline. The first table maps the packet,based on its destination and other characteristics (e.g.,DSCP bits), to a group table. Each group table consistsof the set of tunnels available and a weight assignment thatreflects the ratio of tra�c to be sent to each tunnel. Con-versations with switch vendors indicate that most will rollout support for unequal splitting. When such support is un-available, SWAN uses tra�c profiles to pick boundaries inthe range of IP addresses belonging to a DC such that split-ting tra�c to that DC at these boundaries will lead to thedesired split. Then, SWAN configures rules at the sourceswitch to map IP destination spaces to tunnels. Our ex-periments with tra�c from a production WAN show thatimplementing unequal splits in this way leads to a smallamount of error (less than 2%).

4.2 Computing service allocationsWhen computing allocated rate for services, our goal is

to maximize network utilization subject to service prioritiesand approximate max-min fairness among same-priority ser-vices. The allocation process must be scalable enough tohandle WANs with 100s of switches.

Inputs: The allocation uses as input the service demandsdi between pairs of DCs. While brokers report the demandfor non-interactive services, SWAN estimates the demandof interactive services (see below). We also use as inputthe paths (tunnels) available between a DC pair. Runningan unconstrained multi-commodity problem could result inallocations that require many rules at switches. Since a DCpair’s tra�c could flow through any link, every switch mayneed rules to split every pair’s tra�c across its outgoingports. Constraining usable paths avoids this possibility andalso simplifies data plane updates (§4.3). But it may lead tolower overall throughput. For our two production inter-DCWANs, we find that using the 15 shortest paths betweeneach pair of DCs results in negligible loss of throughput.

Allocation LP: Figure 6 shows the LP used in SWAN. Atthe core is the MCF (multi-commodity flow) function thatmaximizes the overall throughput while preferring shorterpaths; ✏ is a small constant and tunnel weights wj are pro-portional to latency. sPri is the fraction of scratch link ca-pacity that enables congestion-managed network updates;

Inputs:di: flow demands for source destination pair iwj : weight of tunnel j (e.g., latency)

cl: capacity of link lsPri: scratch capacity ([0, 50%]) for class PriIj,l: 1 if tunnel j uses link l and 0 otherwise

Outputs:bi =

Pj bi,j : bi is allocation to flow i; bi,j over tunnel j

Func: SWAN Allocation:

8 links l : cremain

l cl; // remaining link capacity

for Pri = Interactive,Elastic, . . . ,Background do

{bi} Throughput Maximization

Approx. Max-Min Fairness

�Pri, {cremain

l }�;

cremain

l cremain

l �P

i,j bi,j · Ij,l;

Func: Throughput Maximization(Pri, {cremain

l }):return MCF(Pri, {cremain

l }, 0,1, ;);

Func: Approx. Max-Min Fairness(Pri, {cremain

l }):// Parameters ↵ and U trade-o↵ unfairness for runtime

// ↵ > 1 and 0 < U min(fairratei)

T dlog↵max(di)

Ue; F ;;

for k = 1 . . . T doforeach bi 2 MCF(Pri, {cremain

l },↵k�1U,↵kU, F ) doif i /2 F and bi < min(di,↵kU) then

F F + {i}; fi bi; // flow saturated

return {fi : i 2 F};

Func: MCF(Pri, {cremain

l }, bLow, bHigh, F ):

//Allocate rate bi for flows in priority class Primaximize

Pi bi � ✏(

Pi,j wj · bi,j)

subject to 8i /2 F : bLow bi min(di, bHigh);

8i 2 F : bi = fi;8l :

Pi,j bi,j · Ij,l min{cremain

l , (1� sPri)cl};8(i, j) : bi,j � 0.

Figure 6: Computing allocations over a set of tunnels.

it can be di↵erent for di↵erent priority classes (§4.3). TheSWAN Allocation function allocates rate by invoking MCF

separately for classes in priority order. After a class is allo-cated, its allocation is removed from remaining link capacity.

It is easy to see that our allocation respects tra�c prior-ities. By allocating demands in priority order, SWAN alsoensures that higher priority tra�c is likelier to use shorterpaths. This keeps the computation simple because MCF’stime complexity increases manifold with the number of con-straints. While, in general, it may reduce overall utilization,in practice, SWAN achieves nearly optimal utilization (§6).

Max-min fairness can be achieved iteratively: maximizethe minimal flow rate allocation, freeze the minimal flowsand repeat with just the other flows [26]. However, solvingsuch a long sequence of LPs is rather costly in practice, sowe devised an approximate solution instead. SWAN providesapproximated max-min fairness for services in the same classby invoking MCF in T steps, with the constraint that at stepk, flows are allocated rates in the range

⇥↵k�1U,↵kU

⇤, but

no more than their demand. See Fig. 6, function Approx.

Max-Min Fair. A flow’s allocation is frozen at step k whenit is allocated its full demand di at that step or it receives arate smaller than ↵kU due to capacity constraints. If ri andbi are the max-min fair rate and the rate allocated to flowi, we can prove that this is an ↵-approximation algorithm,i.e., bi 2

⇥ri↵,↵ri

⇤(Theorem 1 in [14]3).

Many proposals exist to combine network-wide max-min

3All proofs are included in a separate technical report [14].

Li Erran Li

Li Erran Li

Li Erran Li

Li Erran Li

Li Erran Li

Li Erran Li

Li Erran Li

Li Erran Li

Li Erran Li

Li Erran Li

Li Erran Li

Li Erran Li

Li Erran Li

Li Erran Li

unequalsplitting

fairness with high throughput. A recent one o↵ers a searchfunction that is shown to empirically reduce the number ofLPs that need to be solved [9]. Our contribution is showingthat one can trade-o↵ the number of LP calls and the degreeof unfairness. The number of LPs we solve per priority is T ;with max di=10Gbps, U=10Mbps and ↵=2, we get T=10.We find that SWAN’s allocations are highly fair and take lessthan a second combined for all priorities (§6). In contrast,Danna et al. report running times of over a minute [9].

Finally, our approach can be easily extended to other pol-icy goals such as virtually dedicating capacity to a flow overcertain paths and weighted max-min fairness.

Interactive service demands: SWAN estimates an inter-active service’s demand based on its average usage in the lastfive minutes. To account for prediction errors, we inflate thedemand based on the error in past estimates (mean plus twostandard deviations). This ensures that enough capacity isset aside for interactive tra�c. So that inflated estimates donot let capacity go unused, when allocating rates to back-ground tra�c, SWAN adjusts available link capacities as ifthere was no inflation. If resource contention does occur,priority queueing at switches protects interactive tra�c.

Post-processing: The solution produced by the LP maynot be feasible to implement; while it obeys link capacityconcerns, it disregards rule count limits on switches. Di-rectly including these limits in the LP would turn the LPinto an Integer LP making it intractably complex. Hence,SWAN post-processes the output of the LP to fit into thenumber of rules available.

Finding the set of tunnels with a given size that carriesthe most tra�c is NP-complete [13]. SWAN uses the follow-ing heuristic: first pick at least the smallest latency tunnelfor each DC pair, prefer tunnels that carry more tra�c (asper the LP’s solution) and repeat as long as more tunnelscan be added without violating rule count constraint mj atswitch j. If Mj is the number of tunnels that switch j canstore and � 2 [0, 50%] is the scratch space needed for ruleupdates (§4.3.2), mj=(1��)Mj . In practice, we found that{mj} is large enough to ensure at least two tunnels per DCpair (§6.5). However, the original allocation of the LP is nolonger valid since only a subset of the tunnels are selecteddue to rule limit constraints. We thus re-run the LP withonly the chosen tunnels as input. The output of this run hasboth high utilization and is implementable in the network.To further speed-up allocation computation to work with

large WANs, SWAN uses two strategies. First, it runs theLP at the granularity of DCs instead of switches. DCs haveat least 2 WAN switches, so a DC-level LP has at least 4xfewer variables and constraints (and the complexity of anLP is at least quadratic in this number). To map DC-levelallocations to switches, we leverage the symmetry of inter-DC WANs. Each WAN switch in a DC gets equal tra�cfrom inside the DC as border routers use ECMP for outgoingtra�c. Similarly, equal tra�c arrives from neighboring DCsbecause switches in a DC have similar fan-out patterns toneighboring DCs. This symmetry allows tra�c on each DC-level link (computed by the LP) to be spread equally amongthe switch-level links between two DCs. However, symmetrymay be lost during failures; we describe how SWAN handlesfailures in §4.4.Second, during allocation computation, SWAN aggregates

the demands from all services in the same priority class be-tween a pair of DCs. This reduces the number of flows that

Inputs:

8>>><

>>>:

q, sequence length

b0i,j = bi,j , initial configuration

bqi,j = b0i,j , final configuration

cl, capacity of link lIjl, indicates if tunnel j using link l

Outputs: {bai,j} 8a 2 {1, . . . q} if feasible

maximize cmargin

// remaining capacity margin

subject to 8i, a :

Pj b

ai,j = bi;

8l, a : cl �P

i,j max(bai,j , ba+1

i,j ) · Ij,l + cmargin

;

8(i, j, a) : bai,j � 0; cmargin

� 0;

Figure 7: LP to find if a congestion-free update sequenceof length q exists.

the LP has to allocate by a factor that equals the numberof services, which can run into 100s. Given the per DC-pair allocation, we divide it among individual services in amax-min fair manner.

4.3 Updating forwarding stateTo keep the network highly utilized, its forwarding state

must be updated as tra�c demand and network topologychange. Our goal is to enable forwarding state updates thatare not only congestion-free but also quick; the more agilethe updates the better one can utilize the network. Onecan meet these goals trivially, by simply pausing all datamovement on the network during a configuration change.Hence, an added goal is that the network continue to carrysignificant tra�c during updates.4

Forwarding state updates are of two types: changing thedistribution of tra�c across available tunnels and changingthe set of tunnels available in the network. We describebelow how we make each type of change.

4.3.1 Updating traffic distribution across tunnelsGiven two congestion-free configurations with di↵erent

tra�c distributions, we want to update the network from thefirst configuration to the second in a congestion-free man-ner. More precisely, let the current network configurationbe C={bij : 8(i, j)}, where bij is the tra�c of flow i overtunnel j. We want to update the network’s configurationto C0={b0ij : 8(i, j)}. This update can involve moving manyflows, and when an update is applied, the individual switchesmay apply the changes in any order. Hence, many transientconfigurations emerge, and in some, a link’s load may bemuch higher than its capacity. We want to find a sequenceof configurations (C=C

0

, . . . , Ck=C0) such that no link isoverloaded in any configuration. Further, no link should beoverloaded when moving from Ci to Ci+1

regardless of theorder in which individual switches apply their updates.

In arbitrary cases congestion-free update sequences do notexist; when all links are full, any first move will congest atleast one link. However, given the scratch capacity that weengineered on each link (sPri; §4.2), we show that there ex-ists a congestion-free sequence of updates of length no morethan d1/se�1 steps (Theorem 2 in [14]). The constructiveproof of this theorem yields an update sequence with ex-actly d1/se�1 steps. But shorter sequences may exist andare desirable because they will lead to faster updates.

We use an LP-based algorithm to find the sequence withthe minimal number of steps. Figure 7 shows how to exam-

4Network updates can cause packet re-ordering. In thiswork, we assume that switch-level (e.g., FLARE [18]) orhost-level mechanisms (e.g., reordering robust TCP [36]) arein place to ensure that applications are not hurt.

Li Erran Li

Li Erran Li

Li Erran Li

Li Erran Li

Li Erran Li

Li Erran Li

Li Erran Li

Li Erran Li

Li Erran Li

Li Erran Li

Li Erran Li

Li Erran Li

Li Erran Li

Li Erran Li

Li Erran Li

Li Erran Li

Li Erran Li

ine whether a feasible sequence of q steps exists. We varyq from 1 to d1/se�1 in increments of 1. The key part inthe LP is the constraint that limits the worst case load on alink during an update to be below link capacity. This loadis

Pi,j max(bai,j , b

a+1

i,j )Ij,l at step a; it happens when noneof the flows that will decrease their contribution have doneso, but all flows that will increase their contribution havealready done so. If q is feasible, the LP outputs Ca={bai,j},for a=(1, . . . , q � 1), which represent the intermediate con-figurations that form a congestion-free update sequence.

From congestion-free to bounded-congestion: Weshowed above that leaving scratch capacity on each link fa-cilitates congestion-free updates. If there exists a class oftra�c that is tolerant to moderate congestion (e.g., back-ground tra�c), then scratch capacity need not be left idle;we can fully use link capacities with the caveat that transientcongestion will only be experienced by tra�c in this class.To realize this, when computing flow allocations (§4.2), weuse sPri = s > 0 for interactive and elastic tra�c, but setsPri = 0 for background tra�c (which is allocated last).Thus, link capacity can be fully used, but no more than(1� s) fraction is used by non-background tra�c. Just this,however, is not enough: since links are no longer guaranteedto have slack there may not be a congestion-free solutionwithin d

1

se � 1 steps. To remedy this, we replace the per-

link capacity constraint in Figure 7 with two constraints, oneto ensure that the worst-case tra�c on a link from all classesis no more than (1 + ⌘) of link capacity (⌘ 2 [0, 50%]) andanother to ensure that the worst-case tra�c due to the non-background tra�c is below link capacity. In this case, weprove that i) there is a feasible solution within max(d1/se–1, d1/⌘e) steps (Theorem 3 in [14]) such that ii) the non-background tra�c never encounters loss and iii) the back-ground tra�c experiences no more than an ⌘ fraction loss.Based on this result, we set ⌘ = s

1�sin SWAN, which ensures

the same d

1

se � 1 bound on steps as before.

4.3.2 Updating tunnelsTo update the set of tunnels in the network from P

to P 0, SWAN first computes a sequence of tunnel-sets(P=P

0

, . . . , Pk=P 0) that each fit within rule limits ofswitches. Second, for each set, it computes how much tra�cfrom each service can be carried (§4.2). Third, it signals ser-vices to send at a rate that is minimum across all tunnel-sets.Fourth, after Th=10 seconds when services have changedtheir sending rate, it starts executing tunnel changes asfollows. To go from set Pi to Pi+1

: i) add tunnels thatare in Pi+1

but not in Pi—the computation of tunnel-sets(described below) guarantees that this will not violate rulecount limits; ii) change tra�c distribution, using bounded-congestion updates, to what is supported by Pi+1

, whichfrees up the tunnels that are in Pi but not in Pi+1

; iii)delete these tunnels. Finally, SWAN signals to services tostart sending at the rate that corresponds to P 0.

We compute the interim tunnel-sets as follows. Let P addi

and P remi be the set of tunnels that remain be added and

removed, respectively, at step i. Initially, P add0

=P 0�P and

P rem0

=P�P 0. At each step i, we first pick a subset pai ✓

P addi to add and a subset pri ✓ P rem

i to remove. We thenupdate the tunnel sets as: Pi+i=(Pi[p

ai )�pri , P

addi+1

=P addi �

pai , and P remi+1

=P remi � pri . The process ends when P add

i andP remi are empty (at which point Pi will be P 0).At each step, we also maintain the invariant that Pi+1

,

which is the next set of tunnels that will be installed in thenetwork, leaves �Mj rule space free at every switch j. Weachieve this by picking the maximal set pai such that thetunnels in pa

0

[ · · · [ pai fit within taddi rules and the mini-mal set pri such that the tunnels that remain to be removed(P rem

i �pri ) fit within tremi rules. The value of taddi increaseswith i and that of tremi decreases with i; they are definedmore precisely in Theorem 4 in [14]. Within the size con-straint, when selecting pai , SWAN prefers tunnels that willcarry more tra�c in the final configuration (P 0) and thosethat transit through fewer switches. When selecting pri , itprefers tunnels that carry less tra�c in Pi and those thattransit through more switches. This biases SWAN towardsfinding interim tunnel-sets that carry more tra�c and usefewer rules.

We show that the algorithm above requires at mostd1/�e�1 steps and satisfies the rule count constraints (The-orem 4 in [14]). At interim steps, some services may get anallocation that is lower than that in P or P 0. The problem offinding interim tunnel-sets in which no service’s allocation islower than the initial and final set, given link capacity con-straints, is NP-hard. (Even much simpler problems relatedto rule-limits are NP-hard [13]). In practice, however, ser-vices rarely experience short-term reductions (§6.6). Also,since both P and P 0 contain a common core in which thereis at least one common tunnel between each DC-pair (perour tunnel selection algorithm; §4.2), basic connectivity isalways maintained during transitions, which in practice suf-fices to carry at least all of the interactive tra�c.

4.4 Handling failuresGracefully handling failures is an important part of a

global resource controller. We outline how SWAN handlesfailures. Link and switch failures are detected and com-municated to the controller by network agents, in responseto which the controller immediately computes new alloca-tions. Some failures can break the symmetry in topologythat SWAN leverages for scalable computation of allocation.When computing allocations over an asymmetric topology,the controller expands the topology of impacted DCs andcomputes allocations at the switch level directly.

Network agents, service brokers, and the controller havebackup instances that take over when the primary fails. Forsimplicity, the backups do not maintain state but acquirewhat is needed upon taking over. Network agents query theswitches for topology, tra�c, and current rules. Service bro-kers wait for Th (10 seconds), by which time all hosts wouldhave contacted them. The controller queries the networkagents for topology, tra�c, and current rule set, and servicebrokers for current demand. Further, hosts stop sendingtra�c when they are unable to contact the (primary andsecondary) service broker. Service brokers retain their cur-rent allocation when they cannot contact the controller. Inthe period between the primary controller failing and thebackup taking over, the network continues to forward tra�cas last configured.

4.5 Prototype implementationWe have developed a SWAN prototype that implements

all the elements described above. The controller, servicebrokers and hosts, and network agents communicate witheach other using RESTful APIs. We implemented networkagents using the Floodlight OpenFlow controller [11], which

Li Erran Li

Li Erran Li

Li Erran Li

Arista

Cisco N3K

Blade

Server

(a)

Datacenters

Aggregate cable bundles

Hong Kong

New York

Florida

Barcelona

Los Angeles

(b)

WAN-facing switch Border router

Data Center

Server

…

…

Data Center

5 5

(c)

Figure 8: Our testbed. (a) Partial view of the equip-ment. (b) Emulated DC-level topology. (c) Closer lookat physical connectivity for a pair of DC.

allows SWAN to work with commodity OpenFlow switches.We use the QoS features in Windows Server 2012 to markDSCP bits in outgoing packets and rate limit tra�c usingtoken buckets. We configure priority queues per class inswitches. Based on our experiments (§6), we set s=10% and�=10% in our prototype.

5. TESTBED-BASED EVALUATIONWe evaluate SWAN on a modest-sized testbed. We ex-

amine the e�ciency and the value of congestion-controlledupdates using today’s OpenFlow switches and under TCPdynamics. The results of several other testbed experiments,such as failure recovery time, are in [14]. We will extend ourevaluation to the scale of today’s inter-DC WANs in §6.

5.1 Testbed and workloadOur testbed emulates an inter-DC WAN with 5 DCs

spread across three continents (Figure 8). Each DC has: i)two WAN-facing switches; ii) 5 servers per DC, where eachserver has a 1G Ethernet NIC and acts as 25 virtual hosts;and iii) an internal router that splits tra�c from the hostsover the WAN switches. A logical link between DCs is twophysical links between their WAN switches. WAN switchesare a mix of Arista 7050Ts and IBM Blade G8264s, androuters are a mix of Cisco N3Ks and Juniper MX960s. TheSWAN controller is in New York, and we emulate controlmessage delays based on geographic distances.

In our experiment, every DC pair has a demand in eachpriority class. The demand of the Background class is infi-nite, whereas Interactive and Elastic demands vary with aperiod of 3-minutes as per the patterns shown in Figure 9.Each DC pair has a di↵erent phase, i.e., their demands arenot synchronized. We picked these demands because theyhave sudden changes in quantity and spatial characteristicsto stress SWAN. The actual tra�c per {DC-pair, class} con-sists of 100s of TCP flows. Our switches do not supportunequal splitting, so we insert appropriate rules into theswitches to split tra�c as needed based on IP headers.

We set Ts and Tc, the service demand and network up-date frequencies, to one minute, instead of five, to stress-testSWAN’s dynamic behavior.

5.2 Experimental resultsE�ciency: Figure 10 shows that SWAN closely approxi-mates the throughput of an optimal method. For each 1-mininterval, this method computes service rates using a multi-class, multi-commodity flow problem that is not constrained

��

��

��

��

��

��

��

��

��

��(a) Interactive

��

�� (b) Elastic

Figure 9: Demand patterns for testbed experiments.

��

��

��

��

��

�

��

�� !" ��!" ��

# ��

$�� "��

Figure 10: SWAN achieves near-optimal throughput.

��

��

��

��

��

��

��

��

��

��

��

��

��

��

� � ��

(a) SWAN

��

��

��

��

��

��

��

��

��

��

��

��

��

��

� � ��

(b) One-shot

��

� ��

��

��

��

! ��

"# $

(c) One-shot

Figure 11: Updates in SWAN do not cause congestion.

by the set of available tunnels or rule count limits. It’s pre-diction of interactive tra�c is perfect, it has no overheaddue to network updates, and it can modify service rates in-stantaneously.

Overall, we see that SWAN closely approximates the op-timal method. The dips in tra�c occur during updates be-cause we ask services whose new allocations are lower toreduce their rates, wait for Th=10 seconds, and then askservices with higher allocations to increase their rate. Theimpact of these dips is low in practice when there are moreflows and the update frequency is 5 minutes (§6.6).

Congestion-controlled updates: Figure 11a zooms inon an example update. A new epoch starts at zero and thethroughput of each class is shown relative to its maximalallocation before and after the update. We see that withSWAN there is no adverse impact on the throughput in anyclass when the forwarding plane update is executed at t=10s.

To contrast, Figure 11b shows what happens withoutcongestion-controlled updates. Here, as in SWAN, 10% ofscratch capacity is kept with respect to non-background traf-fic, but all update commands are issued to switches in onestep. We see that Elastic and Background classes su↵ertransient throughput degradation due to congestion inducedlosses followed by TCP backo↵s. Interactive tra�c is pro-tected due to priority queuing in this example but that doesnot hold for updates that move a lot of interactive tra�cacross paths. During updates, the throughput degradationacross all tra�c in a class is 20%, but as Figure 11c shows,it is as high as 40% for some of the flows.

6. DATA-DRIVEN EVALUATIONTo evaluate SWAN at scale, we conduct data-driven sim-

ulations with topologies and tra�c from two productioninter-DC WANs of large cloud service providers (§6.1). Weshow that SWAN can carry 60% more tra�c than MPLSTE (§6.2) and is fairer than MPLS TE (§6.3). We also show

that SWAN enables congestion-controlled updates (§6.4) us-ing bounded switch state (§6.5).

6.1 Datasets and methodologyWe consider two inter-DC WANs:

IDN: A large, well-connected inter-DC WAN with morethan 40 DCs. We have accurate topology, capacity, andtra�c information for this network. Each DC is connected to2-16 other DCs, and inter-DC capacities range from tens ofGbps to Tbps. Major DCs have more neighbors and highercapacity connectivity. Each DC has two WAN routers forfault tolerance, and each router connects to both routersin the neighboring DC. We obtain flow-level tra�c on thisnetwork using sFlow logs collected by routers.

G-Scale: Google’s inter-DC WAN with 12 DCs and 19inter-DC links [15]. We do not have tra�c and capacity in-formation for it. We simulate tra�c on this network usinglogs from another production inter-DC WAN (di↵erent fromIDN) with a similar number of DCs. In particular, we ran-domly map nodes from this other network to G-Scale. Thismapping retains the burstiness and skew of inter-DC tra�c,but not any spatial relationships between the nodes.

We estimate capacity based on the gravity model [30]. Re-flecting common provisioning practices, we also round capac-ity up to the nearest multiple of 80 Gbps. We obtained qual-itatively similar results (omitted from the paper) with threeother capacity assignment methods: i) capacity is based on5-minute peak usage across a week when the tra�c is carriedover shortest paths using ECMP (we cannot use MPLS TEas that requires capacity information); ii) capacity betweeneach pair of DCs is 320 Gbps; iii) capacity between a pairof DCs is 320 or 160 Gbps with equal probability.

With the help of network operators, we classify tra�c intoindividual services and map each service to Interactive, Elas-tic, or Background class.

We conduct experiments using a flow-level simulator thatimplements a complete version of SWAN. The demand of theservices is derived based on the tra�c information from aweek-long network log. If the full demand of a service is notallocated in an interval, it carries over to the next interval.We place the SWAN controller at a central DC and simulatecontrol plane latency between the controller and entities inother DCs (service brokers, network agents). This latencyis based on shortest paths, where the latency of each hop isbased on speed of light in fiber and great circle distance.

6.2 Network utilizationTo evaluate how well SWAN utilizes the network, we com-

pare it to an optimal method that can o↵er 100% utiliza-tion. This method computes how much tra�c can be car-ried in each 5-min interval by solving a multi-class, multi-commodity flow problem. It is restricted only by link capac-ities, not by rule count limits. The changes to service ratesare instantaneous, and rate limiting and interactive tra�cprediction is perfect.

We also compare SWAN to the current practice, MPLSTE (§2). Our MPLS TE implementation has the advancedfeatures that IDN uses [4, 24]. Priorities for packets and tun-nels protect higher-priority packets and ensure shorter pathsfor higher-priority services. Per re-optimization, CSPF isinvoked periodically (5 minutes) to search for better pathassignments. Per auto-bandwidth, tunnel bandwidth is pe-riodically (5 minutes) adjusted based on the current tra�c

0 0.2 0.4 0.6 0.8

1

Thro

ughp

ut [n

or-

mal

nize

d to

opt

imal

]

SWANSWAN

w/o RateControl

MPLSTE

99.0%

70.3%58.3%

(a) IDN

SWANSWAN

w/o RateControl

MPLSTE

98.1%75.5%

65.4%

(b) G-Scale

Figure 12: SWAN carries more tra�c than MPLS TE.

demand, estimated by the maximum of the average (across5-minute intervals) demand in the past 15 minutes.

Figure 12 shows the tra�c that di↵erent methods cancarry compared to the optimal. To quantify the tra�c thata method can carry, we scale service demands by the samefactor and use binary search to derive the maximum ad-missible tra�c. We define admissibility as carrying at least99.9% of service demands. Using a threshold less than 100%makes results robust to demand spikes.

We see that MPLS TE carries only around 60% of the op-timal amount of tra�c. SWAN, on the other hand, can carry98% for both WANs. This di↵erence means that SWAN car-ries over 60% more tra�c that MPLS TE, which is a signif-icant gain in the value extracted from the inter-DC WAN.

To decouple gains of SWAN from its two maincomponents—coordination across services and global net-work configuration—we also simulated a variant of SWAN

where the former is absent. Here, instead of getting demandrequests from services, we estimate it from their throughputin a manner similar to MPLS TE. We also do not controlthe rate at which services send. Figure 12 shows that thisvariant of SWAN improves utilization by 10–12% over MPLSTE, i.e., it carries 15–20% more tra�c. Even this level ofincrease in e�ciency translates to savings of millions of dol-lars in the cost of carrying wide-area tra�c. By studying a(hypothetical) version of MPLS that perfectly knows futuretra�c demand (instead of estimating it based on history),we find that most of SWAN’s gain over MPLS stems fromits ability to find better path assignments.

We draw two conclusions from this result. First, bothcomponents of SWAN are needed to fully achieve its gains.Second, even in networks where incoming tra�c cannot becontrolled (e.g., ISP network), worthwhile utilization im-provements can be obtained through the centralized resourceallocation o↵ered by SWAN.

In the rest of the paper, we present results only for IDN.The results for G-Scale are qualitatively similar and are de-ferred to [14] due to space constraints.

6.3 FairnessSWAN improves not only e�ciency but also fairness. To

study fairness, we scale demands such that background traf-fic is 50% higher than what a mechanism admits; fairness isof interest only when tra�c demands cannot be fully met.Scaling relative to tra�c admitted by a mechanism ensuresthat oversubscription level is the same. If we used an identi-cal demand for SWAN and MPLS TE, the oversubscriptionfor MPLS TE would be higher as it carries less tra�c.

For an exemplary 5-minute window, Figure 13a shows thethroughput that individual flows get relative to their max-min fair share. We focus on background tra�c as the higherpriority for other tra�c means that its demands are often

Li Erran Li

(a)��

��

��

��

� !"

#$%��&'

��(��)(�

��*��

��+,��

(b)

0 0.1 0.2 0.3

0.01 0.1 1 10

Com

plem

enta

ryCD

F ov

er �

ows

& tim

e

Relative error [throughput vs. max-min fair rate]

SWAN MPLS TE

Figure 13: SWAN is fairer than MPLS TE.

��

� ��

��

��

��

��

��

��

��

��

�� !��

��

��

��

��

� � � � �

��

��

��

��

Figure 14: Number of stages and loss in networkthroughput as a function of scratch capacity.

met. We compute max-min fair shares using a precise butcomputationally-complex method (which is unsuitable foronline use) [26]. We see that SWAN well approximates max-min fair sharing. In contrast, the greedy, local allocation ofMPLS TE is significantly unfair.

Figure 13b shows aggregated results. In SWAN, only 4%of the flows deviate over 5% from their fair share. In MPLSTE, 20% of the flows deviate by that much, and the worst-case deviation is much higher. As Figure 13a shows, theflows that deviate are not necessarily high- or low-demand,but are spread across the board.

6.4 Congestion-controlled updatesWe now study congestion-controlled updates, first the

tradeo↵ regarding the amount of scratch capacity and thentheir benefit. Higher levels of scratch capacity lead tofewer stages, and thus faster transitions; but they lower theamount of non-background tra�c that the network can carryand can waste capacity if background tra�c demand is low.Figure 14 shows this tradeo↵ in practice. The left graphplots the maximum number of stages and loss in networkthroughput as a function of scratch capacity. At the s=0%extreme, throughput loss is zero but more stages—infinitelymany in the worst case—are needed to transition safely. Atthe s=50% extreme, only one stage is needed, but the net-work delivers 25�36% less tra�c. The right graph shows thePDF of the number of stages for three values of s. Based onthese results, we use s=10%, where the throughput loss isnegligible and updates need only 1-3 steps (which is muchlower than the theoretical worst case of 9).

To evaluate the benefit of congestion-controlled updates,we compare with a method that applies updates in one shot.This method is identical in every other way, including theamount of scratch capacity left on links. Both methods sendupdates in a step to the switches in parallel. Each switch ap-plies its updates sequentially and takes 2 ms per update [8].

For each method, during each reconfiguration, we com-pute the maximum over-subscription (i.e., load relative tocapacity), at each link. Short-lived oversubscription will beabsorbed by switch queues. Hence, we also compute the

0.01%

0.1%

1%

10%

0 0.2 0.4 0.6Com

plem

anta

ryCD

F ov

erlin

ks &

upd

ates

Oversubscription ratio 0 100 200 300Overloaded tra�c [MB]

one-shot;bgSWAN;bg

one-shot;non-bg

Figure 15: Link oversubscription during updates.

��

��

��

��

��

��

��

�� !��"�

#$!�� !��!

�%&'()�*��*

��*

��

��

��

��

��

��

Figure 16: SWAN needs fewer rules to fully exploit net-work capacity (left). The number of stages needed forrule changes is small (right).

maximal bu↵ering required at each link for it to not dropany packet, i.e., total excess bytes that arrive during over-subscribed periods. If this number is higher than the sizeof the physical queue, packets will be dropped. Per pri-ority queuing, we compute oversubscription separately foreach tra�c class; the computation for non-background traf-fic ignores background tra�c but that for background tra�cconsiders all tra�c.

Figure 15 shows oversubscription ratios on the left. Wesee heavy oversubscription with one-shot updates, especiallyfor background tra�c. Links can be oversubscribed by upto 60% of their capacity. The right graph plots extra byteson the links. Today’s top-of-line switches, which we usein our testbed, have queue sizes of 9-16 MB. But we seethat oversubscription can bring 100s of MB of excess pack-ets and hence, most of these will be dropped. Note that wedid not model TCP backo↵s which would reduce the loadon a link after packet loss starts happening, but regardless,those flows would see significant slowdown. With SWAN,the worst-case oversubscription is only 11% (= s

1�s) as con-

figured for bounded-congestion updates, which presents asignificantly better experience for background tra�c.

We also see that despite 10% slack, one-shot updates failto protect even the non-background tra�c which is sensitiveto loss and delay. Oversubscription can be up to 20%, whichcan bring over 50 MB of extra bytes during reconfigurations.SWAN fully protects non-background tra�c and hence thatcurve is omitted.

Since routes are updated very frequently even a small like-lihood of severe packet loss due to updates can lead to fre-quent user-visible network incidents. For e.g., when updateshappen every minute, a 1

1000

likelihood of severe packet lossdue to route updates leads to an interruption, on average,once every 7 minutes on the IDN network.

6.5 Rule managementWe now study rule management in SWAN. A primary mea-

sure of interest here is the amount of network capacity thatcan be used given a switch rule count limit. Figure 16 (left)shows this measure for SWAN and an alternative that in-stalls rules for the k-shortest paths between DC-pairs; k ischosen such that the rule count limit is not violated for anyswitch. We see that k-shortest path routing requires 20Krules to fully use network capacity. As mentioned before,this requirement is beyond what will be o↵ered by next-

��

��

��

��

��

��

��

��

��

��

��

��

��

��

!��"�� #��

$��

Figure 17: Time for network update.

��

� ��

��

��

�� !"��#$��%��&��#��#��#��

(a)

��

� ��

��

��

��

��

��

��(b)

Figure 18: (a) SWAN carries close to optimal tra�c evenduring updates. (b) Frequent updates lead to higherthroughput.

generation switches. The natural progression towards fasterlink speeds and larger WANs means that future switchesmay need even more rules. If switches support 1K rules,k-shortest path routing is unable to use 10% of the networkcapacity. In contrast, SWAN’s dynamic tunnels approachenables it to fully use network capacity with an order ofmagnitude fewer rules. This fits within the capabilities ofcurrent-generation switches.

Figure 16 (right) shows the number of stages needed todynamically change tunnels. It assumes a limit of 750 Open-Flow rules, which is what our testbed switches support.With 10% slack only two stages are needed 95% of the time.This nimbleness stems from 1) the e�ciency of dynamictunnels—a small set of rules are needed per interval, and2) temporal locality in demand matrices—this set changesslowly across adjacent intervals.

6.6 Other microbenchmarksWe close our evaluation of SWAN by reporting on some

key microbenchmarks.

Update time: Figure 17 shows the time to update IDN

from the start of a new epoch. Our controller uses a PC witha 2.8GHz CPU and runs unoptimized code. The left graphshows a CDF across all updates. The right graph depicts atimeline of the average time spent in various parts. Most up-dates finish in 22s; most of this time goes into waiting for ser-vice rate limits to take e↵ect, 10s each to wait for services toreduce their rate (t

1

to t3

) and then for those whose rate in-creases (t

4

to t5

). SWAN computes the congestion-controlledplan in parallel with the first of these. The network’s dataplane is in flux for only 600 ms on average (t

3

to t4

). Thisincludes communication delay from controller to switchesand the time to update rules at switches, multiplied by thenumber of stages required to bound congestion. If SWAN

were used in a network without explicit resource signaling,the average update time would only be this 600 ms.

Tra�c carried during updates: During updates, SWAN

ensures that the network continues to maintain high utiliza-tion. That the overall network utilization of SWAN comesclose to optimal (§6.2) is an evidence of this behavior. Moredirectly, Figure 18a shows the %-age of tra�c that SWAN

carries during updates compared to an optimal method withinstantaneous updates. The median value is 96%.

Update frequency: Figure 18b shows that frequent up-dates to the network’s data plane lead to higher e�ciency.It plots the drop in throughput as the update duration isincreased. The service demands still change every 5 minutesbut the network data plane updates at the slower rate (x-axis) and the controller allocates as much tra�c as the cur-rent data plane can carry. We see that an update frequencyof 10 (100) minutes reduces throughput by 5% (30%).

Prediction error for interactive tra�c: The error inpredicting interactive tra�c demand is small. For only 1%of the time, the actual amount of interactive tra�c on a linkdi↵ers from the predicted amount by over 1%.

7. DISCUSSIONThis section discusses several issues that, for conciseness,

were not mentioned in the main body of the paper.

Non-conforming tra�c: Sometimes services may (e.g.,due to bugs) send more than what is allocated. SWAN candetect these situations using tra�c logs that are collectedfrom switches every 5 minutes. It can then notify the ownersof the service and protect other tra�c by re-marking theDSCP bits of non-confirming tra�c to a class that is evenlower than background tra�c, so that it’s carried only ifthere is any spare capacity.

Truthful declaration: Services may declare their lower-priority tra�c as higher priority or ask for more bandwidththan they can consume. SWAN discourages this behaviorthrough appropriate pricing: services pay more for higherpriority and pay for all allocated resources. (Even withina single organization, services pay for the infrastructure re-sources they consume.)

Richer service-network interface: Our current designhas a simple interface between the services and network,based on current bandwidth demand. In future work, wewill consider a richer interface such as letting services reserveresources ahead of time and letting them express their needsin terms of total bytes and a deadline by which they mustbe transmitted. Better knowledge of such needs can furtherboost e�ciency, for instance, by enabling store-and-forwardtransfers through intermediate DCs [21]. The key challengehere is the design of scalable and fair allocation mechanismsthat composes the diversity of service needs.

8. RELATED WORKSWAN builds upon several themes in prior work.

Intra-DC tra�c management: Many recent worksmanage intra-DC tra�c to better balance load [1, 7, 8] orshare among selfish parties [16, 28, 31]. SWAN is similarto the former in using centralized TE and to the latter inproviding fairness. But the intra-DC case has constraintsand opportunities that do not translate to the WAN. Forexample, EyeQ [16] assumes that the network has a full bi-section bandwidth core and hence only paths to or from thecore can be congested; this need not hold for a WAN. Sea-wall [31] uses TCP-like adaptation to converge to fair share,but high RTTs on the WAN would mean slow convergence.Faircloud [28] identifies strategy-proof sharing mechanisms,i.e., resilient to the choices of individual actors. SWAN usesexplicit resource signaling to disallow such greedy actions.

Signaling also helps it avoid estimating demands which othercentralized TE schemes have to do [1, 8].

WAN TE & SDN: As in SWAN, B4 uses SDNs in thecontext of inter-DC WANs [15]. Although this parallel workshares a similar high-level architecture, it addresses di↵erentchallenges. While B4 develops custom switches and mech-anisms to integrate existing routing protocols in an SDNenvironment, SWAN develops mechanisms for congestion-free data plane updates and for e↵ectively using the limitedforwarding table capacity of commodity switches.

Optimizing WAN e�ciency has rich literature includingtuning ECMP weights [12], adapting allocations across pre-established tunnels [10, 17], storing and re-routing bulk dataat relay nodes [21], caching at application-layer [32] andleveraging reconfigurable optical networks [22]. While suchbandwidth e�ciency is one of the design goals, SWAN alsoaddresses performance and bandwidth requirements of dif-ferent tra�c classes. In fact, SWAN can help many of thesesystems by providing available bandwidth information andby o↵ering routes through the WAN that may not be dis-covered by application-layer overlays.

Guarantees during network update: Some recent workprovides guarantees during network updates either on con-nectivity, or loop-free paths or that a packet will see a con-sistent set of SDN rules [19, 23, 29, 34]. SWAN o↵ers astronger guarantee that the network remains uncongestedduring forwarding rule changes. Vanbever et. al. [34] sug-gest finding an ordering of updates to individual switchesthat is guaranteed to be congestion free; however, we seethat such ordering may not exist (§6.4) and is unlikely toexist when the network operates at high utilization.

9. CONCLUSIONSSWAN enables a highly e�cient and flexible inter-DC

WAN by coordinating the sending rates of services and cen-trally configuring the network data plane. Frequent networkupdates are needed for high e�ciency, and we showed how,by leaving a small amount of scratch capacity on the linksand switch rule memory, these updates can be implementedquickly and without congestion or disruption. Testbed ex-periments and data-driven simulations show that SWAN cancarry 60% more tra�c than the current practice.

Acknowledgements. We thank Rich Groves, ParantapLahiri, Dave Maltz, and Lihua Yuan for feedback on thedesign of SWAN. We also thank Matthew Caesar, BrightenGodfrey, Nikolaos Laoutaris, John Zahorjan, and the SIG-COMM reviewers for feedback on earlier drafts of the paper.

10. REFERENCES[1] M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, and

A. Vahdat. Hedera: Dynamic flow scheduling for data centernetworks. In NSDI, 2010.

[2] D. Applegate and M. Thorup. Load optimal MPLS routingwith N+M labels. In INFOCOM, 2003.

[3] D. Awduche, L. Berger, D. Gan, T. Li, V. Srinivasan, andG. Swallow. RSVP-TE: Extensions to RSVP for LSP tunnels.RFC 3209, 2001.

[4] D. Awduche, J. Malcolm, J. Agogbua, M. O’Dell, andJ. McManus. Requirements for tra�c engineering over MPLS.RFC 2702, 1999.

[5] H. Ballani, P. Costa, T. Karagiannis, and A. Rowstron.Towards predictable datacenter networks. In SIGCOMM, 2011.

[6] Y. Chen, S. Jain, V. K. Adhikari, Z.-L. Zhang, and K. Xu. Afirst look at inter-data center tra�c characteristics via Yahoo!datasets. In INFOCOM, 2011.

[7] M. Chowdhury, M. Zaharia, J. Ma, M. I. Jordan, and I. Stoica.Managing data transfers in computer clusters with Orchestra.In SIGCOMM, 2011.

[8] A. R. Curtis, J. C. Mogul, J. Tourrilhes, P. Yalagandula,P. Sharma, and S. Banerjee. DevoFlow: Scaling flowmanagement for high-performance networks. In SIGCOMM,2011.

[9] E. Danna, S. Mandal, and A. Singh. A practical algorithm forbalancing the max-min fairness and throughput objectives intra�c engineering. In INFOCOM, 2012.

[10] A. Elwalid, C. Jin, S. Low, and I. Widjaja. MATE: MPLSadaptive tra�c engineering. In INFOCOM, 2001.

[11] Project Floodlight. http://www.projectfloodlight.org/.[12] B. Fortz, J. Rexford, and M. Thorup. Tra�c engineering with

traditional IP routing protocols. IEEE Comm. Mag., 2002.[13] T. Hartman, A. Hassidim, H. Kaplan, D. Raz, and M. Segalov.

How to split a flow? In INFOCOM, 2012.[14] C.-Y. Hong, S. Kandula, R. Mahajan, M. Zhang, V. Gill,

M. Nanduri, and R. Wattenhofer. Achieving high utilizationwith software-driven WAN (extended version). MicrosoftResearch Technical Report 2013-54, 2013.

[15] S. Jain et al. B4: Experience with a globally-deployed softwaredefined WAN. In SIGCOMM, 2013.

[16] V. Jeyakumar, M. Alizadeh, D. Mazieres, B. Prabhakar, andC. Kim. EyeQ: Practical network performance isolation for themulti-tenant cloud. In HotCloud, 2012.

[17] S. Kandula, D. Katabi, B. Davie, and A. Charny. Walking thetightrope: Responsive yet stable tra�c engineering. InSIGCOMM, 2005.

[18] S. Kandula, D. Katabi, S. Sinha, and A. Berger. Dynamic loadbalancing without packet reordering. SIGCOMM CCR, 2007.

[19] N. Kushman, S. Kandula, D. Katabi, and B. M. Maggs. R-BGP:Staying connected in a connected world. In NSDI, 2007.

[20] C. Labovitz, S. Iekel-Johnson, D. McPherson, J. Oberheide,and F. Jahanian. Internet inter-domain tra�c. SIGCOMMComput. Commun. Rev., 2010.

[21] N. Laoutaris, M. Sirivianos, X. Yang, and P. Rodriguez.Inter-datacenter bulk transfers with NetStitcher. InSIGCOMM, 2011.

[22] A. Mahimkar, A. Chiu, R. Doverspike, M. D. Feuer, P. Magill,E. Mavrogiorgis, J. Pastor, S. L. Woodward, and J. Yates.Bandwidth on demand for inter-data center communication. InHotNets, 2011.

[23] R. McGeer. A safe, e�cient update protocol for OpenFlownetworks. In HotSDN, 2012.

[24] M. Meyer and J. Vasseur. MPLS tra�c engineering softpreemption. RFC 5712, 2010.

[25] V. S. Mirrokni, M. Thottan, H. Uzunalioglu, and S. Paul. Asimple polynomial time framework for reduced-pathdecomposition in multi-path routing. In INFOCOM, 2004.

[26] D. Nace, N.-L. Doan, E. Gourdin, and B. Liau. Computingoptimal max-min fair resource allocation for elastic flows.IEEE/ACM Trans. Netw., 2006.

[27] A. Pathak, M. Zhang, Y. C. Hu, R. Mahajan, and D. Maltz.Latency inflation with MPLS-based tra�c engineering. In IMC,2011.

[28] L. Popa, G. Kumar, M. Chowdhury, A. Krishnamurthy,S. Ratnasamy, and I. Stoica. FairCloud: Sharing the network incloud computing. In SIGCOMM, 2012.

[29] M. Reitblatt, N. Foster, J. Rexford, C. Schlesinger, andD. Walker. Abstractions for network update. In SIGCOMM,2012.

[30] M. Roughan, A. Greenberg, C. Kalmanek, M. Rumsewicz,J. Yates, and Y. Zhang. Experience in measuring backbonetra�c variability: Models, metrics, measurements and meaning.In Internet Measurement Workshop, 2002.

[31] A. Shieh, S. Kandula, A. Greenberg, C. Kim, and B. Saha.Sharing the data center network. In NSDI, 2011.

[32] S. Traverso, K. Huguenin, I. Trestian, V. Erramilli,N. Laoutaris, and K. Papagiannaki. Tailgate: handling long-tailcontent with a little help from friends. In WWW, 2012.

[33] Broadcom Trident II series. http://www.broadcom.com/docs/features/StrataXGS_Trident_II_presentation.pdf, 2012.

[34] L. Vanbever, S. Vissicchio, C. Pelsser, P. Francois, andO. Bonaventure. Seamless network-wide IGP migrations. InSIGCOMM, 2011.

[35] C. Wilson, H. Ballani, T. Karagiannis, and A. Rowstron. Betternever than late: Meeting deadlines in datacenter networks. InSIGCOMM, 2011.

[36] M. Zhang, B. Karp, S. Floyd, and L. Peterson. RR-TCP: Areordering-robust TCP with DSACK. In ICNP, 2003.

http://www.projectfloodlight.org/

http://www.broadcom.com/docs/features/StrataXGS_Trident_II_presentation.pdf

http://www.broadcom.com/docs/features/StrataXGS_Trident_II_presentation.pdf

Achieving High Utilization with Software-Driven WANlierranli/coms6998-10SDNFall... · 1In some...

Documents

Transcript of Achieving High Utilization with Software-Driven WANlierranli/coms6998-10SDNFall... · 1In some...