LoCo: Localizing Congestion - Cornell Universityvishal/papers/loco_2019.pdf · LoCo takes the...

LoCo: Localizing Congestion

Vishal ShrivastavCornell University

Saksham AgarwalCornell University

Rachit AgarwalCornell University

Hakim WeatherspoonCornell University

AbstractDatacenter congestion control remains to be a hard problem.The challenge is that while congestion control is usually im-plemented at the end-hosts, congestion itself can happen atnetwork switches. Thus, the ability of end-hosts to react tocongestion is fundamentally limited by the timeliness and pre-cision of congestion signals from the network. Unfortunately,despite decades of research, the community is still in quest ofsuch timely and precise congestion signals.

LoCo takes a new approach to resolving the congestioncontrol problem: it localizes the congestion to egress queuesof the end-hosts, precisely where congestion control is im-plemented, while bounding both the worst-case queuing ateach switch and the network utilization. LoCo achieves thisby exploiting the structure in datacenter network topologiesto perform network-wide admission control for each and ev-ery packet in the network. LoCo is a clean-slate design thatrequires changes in network switches as well as end-hostsNICs. We evaluate LoCo using both an end-to-end hardwareimplementation and large-scale simulations; our results showthat LoCo not only achieves the above mentioned theoreticalguarantees, but also achieves better performance than severalstate-of-the-art congestion control protocols over standarddatacenter workloads.

1 IntroductionDatacenter congestion control remains to be a hard problem.The core challenge that makes the problem hard is that whilecongestion control is usually implemented at the end-hosts,congestion itself can happen at both the end-hosts and at thenetwork switches (Figure 1); as a result, end-hosts have torely on congestion signals from the switching fabric (e.g.,delay, packet drops, ECN, etc.) to “infer” the state of con-gestion, and to react to congestion. Thus the performance ofcongestion control protocols relying on congestion signalsis fundamentally limited by the timeliness and precision ofcongestion signals (§2, §6).

In this paper, we present a new datacenter network design,called LoCo, that takes a fundamentally new approach to re-solving the congestion control problem: rather than exploringthe design of the right congestion signal, it directly resolvesthe root of the problem — LoCo ensures that the networkswitches are never congested, thus obviating the need for acongestion signal altogether, and localizes the congestion tothe egress queues of the end-hosts (Figure 2). LoCo thusprovides each end-host a local, real-time, access to all thecongestion state needed to react to congestion (§3.1.3).

SWITCHES

egressqueues

congestion points

scheduler

END-HOST

TxAPP

congestion signal

TxAPP

TxAPP

cong

estio

n co

ntro

l

buffers

Figure 1: Congestion control in today’s datacenters

SWITCHES

egressqueues

congestion point

scheduler

END-HOST

admission control

flow start/endevents

“pull”request

TxAPP

TxAPP

TxAPP

cong

estio

n co

ntro

l

buffers

Figure 2: Congestion control in LoCo

More specifically, LoCo theoretically guarantees boundedqueuing at each network switch while providing worst-casebounds on network utilization for arbitrary workloads (understandard assumptions, §3.2). The key insight in LoCo is thatthe inherent structure in multi-tier multi-rooted tree topolo-gies, the most common datacenter network topology, e.g.,FatTree [4] and VL2 [16], enables performing a distributedand fully asynchronous network-wide admission control thatproactively avoids congestion at the switches (§3): networkswitches, without any explicit coordination or synchroniza-tion, orchestrate the admission and scheduling of each andevery packet in the network to achieve the above properties.Further, LoCo’s design and implementation not merely worksfor full bisection bandwidth topologies (§3.3) but also gener-alizes to oversubscribed topologies (§3.4).

LoCo is a clean-slate design, requiring changes in networkswitches and end-host NICs. To demonstrate the feasibilityof the design, we implement custom NICs and switches onFPGAs (§4), and use them to build a small-scale testbed (§5).Our evaluation over the testbed and large-scale simulationsusing a packet-level simulator demonstrate that LoCo notonly provides worst-case bounds on both the queuing at eachswitch and the network utilization, but also achieves betterperformance than several state-of-the-art congestion controlprotocols over standard datacenter workloads (§5, §6).

LoCo’s design is currently limited to multi-tier multi-rooted network topologies. Generalizing LoCo to unstruc-tured topologies is an interesting future direction.

1

(a) ECN (DCTCP) (b) ECN+PFC (DCQCN) (c) Drops (pFabric) (d) Drops+Cut-Payload (NDP)

Figure 3: State-of-the-art datacenter congestion control protocols using various congestion signals can observe arbitrarily lowthroughput under specific workloads—DCTCP under incast workload, DCQCN and pFabric under permutation alongside incastworkload, and NDP under all-to-all workload, on a 144-node full bisection bandwidth 2-tier topology. Experiment details in §6.2.

2 Motivation and InsightTraditional congestion control protocols, such as TCP [21],are designed based on the end-to-end principle, where the con-gestion control functionality is implemented entirely at theend-hosts. This enables a highly scalable plug-and-play de-sign, which could evolve independently from the core switch-ing fabric. However, an upshot of such a design is that thecongestion control algorithm at the end-hosts has no local,real-time access to the state of congestion in the switchingfabric, and has to rely on some sort of a congestion signalfrom the fabric to infer and react to congestion. The challengeis that congestion signals have two fundamental limitations:Timeliness of congestion signals. There is a delay betweenswitches observing congestion and the congestion signalreaching back the respective end-hosts; for many protocolsthis delay could be as long as a single round-trip time. Manytechniques have been proposed to reduce this delay [2,11,35]but they are fundamentally limited by speed-of-light latencybetween the switch and the end-host. If the traffic changes atfiner timescales, end-hosts will be fundamentally limited intheir ability to react to congestion.Precision of congestion signals. Congestion signals haveto encode the extent and reason for congestion at networkswitches using just a few bits. This fundamentally limitstheir precision, and enforces convergence over multiple it-erations [5, 21, 23, 34]. Unfortunately, the timescale for con-vergence is much coarser-grained than timescales at whichtraffic patterns change in today’s datacenters [12, 29], thuslimiting the effectiveness of congestion control.Next, while an end-to-end congestion control makes sense foran Internet-like network, in the context of a datacenter, wherethe scale is much smaller, topologies are typically structured,and it is relatively easier to customize the networking hard-ware, prior works [5, 6, 14, 18, 19, 26] have made a case tomove away from an end-to-end design and involve both theend-hosts and the network switches in the process of con-gestion control. However, most of these prior works use theabove insight to either improve the timeliness [11, 35] or pre-cision [5, 14, 26] of congestion signals, or reduce the reliance

Congestion signals Protocols Protocol classDelay + Packet Drop TCP [21] Sender-driven

ECN DCTCP [5]Packet Drop + Cut Payload NDP [18] Receiver-driven

Packet Drop pFabric [6] In-networkscheduling

ECN + PFC DCQCN [36] Protocols overDelay + PFC TIMELY [28] lossless fabric

Table 1: Various congestion signals used in datacenter conges-tion control protocols. Using one protocol from each repre-sentative congestion signal and protocol class, Figure 3 showsthat lack of timeliness and precision in congestion signals canresult in arbitrarily low throughput under these protocols.

on congestion signals [6, 19], but fail to completely obviatethe need for a congestion signal. Hence, the performance ofthese protocols is still fundamentally limited by the timelinessand precision of congestion signals. An unfortunate conse-quence is that, for many of these congestion signals (Table 1),the corresponding state-of-the-art congestion control proto-col can result in degraded network performance (§6), or inthe worst-case, arbitrarily low network throughput (even fortraffic patterns fairly common within datacenters; Figure 3).

LoCo takes the insight used in prior works to a new ex-treme, and resolves the very root of the problem—by carefullydividing the congestion control functionality across the end-hosts and the network switches, LoCo altogether obviates thevery need for a congestion signal. More specifically, LoCoexploits the inherent structure in datacenter network topolo-gies to run a distributed network-wide admission control inthe network switches that proactively avoids congestion atthe switches. Thus, LoCo localizes the congestion to end-host egress queues, where it is handled by a local congestioncontrol algorithm, which now has local access to all the stateneeded for congestion control. Through this division of labor,LoCo not only achieves worst-case bounds on both the queu-ing at each switch and the network utilization (§3.2), but alsoachieves high performance for standard datacenter workloads(§6).

2

0

11 0

10

00

0 110

1 0

10

1 2 3 41234

Destination

Sour

ce

0

10 0

00

00

0 010

0 0

00

1 2 3 41234

Destination

Sour

ce

0

10 0

00

00

0 010

0 0

00

1 2 3 41234

Destination

Sour

ce

0

01 0

00

00

0 100

1 0

10

1 2 3 41234

Destination

Sour

ce

0

11 0

10

00

0 110

1 0

10

1 2 3 41234

Destination

Sour

ce

0

10 0

00

00

0 010

0 0

00

1 2 3 41234

Destination

Sour

ce

1

2

3

4

1

2

3

4

1

2

3

4

1

2

3

4

arbitrary traffic matrix T permutation matrix P

(a) (b) (c)

Matching

T[s][d] = 1 means active flow (s d)

0

00 0

00

00

0 010

0 0

00

1 2 3 41234

Destination

Sour

ce

0

01 0

00

00

0 100

0 0

10

1 2 3 41234

Destination

Sour

ce

Figure 4: Flow scheduling for admission control in LoCo. (a) Converting an arbitrary traffic matrix into a permutation trafficmatrix. (b) Different permutation matrices for the traffic matrix in (a). (c) Mapping the problem of converting an arbitrary trafficmatrix into a permutation matrix to the problem of matching in bipartite graphs.

3 LoCoIn this section, we describe the design of LoCo. LoCo’s designworks for both virtual input and virtual output switch queuemodels, but to keep the description in this section precise,we assume that network switches are virtual output queued(VOQ), where each egress port comprises two sets of virtualqueues, one for data and the other for control packets.

We first describe the core ideas behind LoCo’s design (§3.1)and state the key results (§3.2), followed by a detailed descrip-tion of the design (§3.3 and §3.4). Finally, we address someof the interesting practical challenges around LoCo’s design(§3.5, §3.6 and §3.7).

3.1 LoCo Core IdeasLoCo exploits the structure in multi-tier multi-rooted datacen-ter network topologies to perform network-wide admissioncontrol: LoCo switches, without any explicit coordination,orchestrate the admission and scheduling of each and everypacket in the network in a manner that congestion can belocalized to end-host egress queues.

In this subsection, we describe the core ideas in LoCousing a single switch case, and later generalize it to multi-tiermulti-rooted datacenter networks.

3.1.1 Intuition: LoCo for a single switch network

Consider a single LoCo switch connected to a number of end-hosts. The high-level operation of LoCo admission controlprotocol is as follows:

1. Each individual end-host, upon arrival or departure of aflow, generates a flow notification packet; the notificationincludes the flow’s source and destination, and stateswhether the flow has arrived at the source or has finished.A flow notification packet may contain arrival/departureinformation about multiple flows. LoCo enforces flownotification packets to be of same size.

2. Let the number of end-hosts be N. For the single switchcase, LoCo switch maintains a N×N binary matrix, with

the (i, j) cell entry set to 1 if source on port i has out-standing packets to destination on port j; otherwise, thecell entry is set to 0. The traffic matrix is updated onthe receipt of every flow notification packet from theend-hosts, and may have an arbitrary structure — forall-to-all traffic, all entries will be set to 1 and for per-mutation traffic, each row and each column will haveat most one entry set to 1, etc. An example is shown inFigure 4a.

3. Let us assume fixed size data packets of size p (§4.3describes how we choose the value of p). Let t(p,B) bethe total transmission time for a p-sized data packet anda flow notification packet over a link of bandwidth B.Every t(p,B) time units, LoCo switch converts the cur-rent traffic matrix to a (not necessarily full) permutationtraffic matrix, as shown in Figure 4a.

4. The switch then sends each source end-host in the per-mutation traffic matrix a PULL request. The end-host(s),upon receiving the request, respond immediately by send-ing a p-sized data packet for the destination specifiedin the request, along with at most one flow notificationpacket.

Thus the key aspect of LoCo’s design is the end-hosts donot voluntarily release packets into the switching fabric, butrather the LoCo switch controls when and which packets toadmit into the switching fabric, using the PULL packets. Thisinsight will be key to proving many of the LoCo’s bounds.

LoCo properties for a single switch network. Any givenswitch in a single switch network would support zero-queuedata transfers as long as the following property holds: eachend-host sends to and receives from a single end-host. LoCoswitch, by reducing the arbitrary traffic matrix to a permuta-tion matrix, achieves precisely this property — by admittinga permutation traffic, LoCo switch incurs zero queuing at allits outgoing ports.

The other property of LoCo is related to the reductionitself. Indeed, there may be multiple permutation matrices

3

corresponding to the same traffic matrix (Figure 4b), eachof which may achieve a different network utilization whilestill guaranteeing zero queuing at LoCo switch. For instance,in n = 4-node network in Figure 4, the utilization may varyfrom a factor 1/n to 1 of the optimal utilization, dependingon which permutation matrix from Figure 4b is chosen. For-tunately, the problem of reducing an arbitrary traffic matrixto a permutation matrix can be reduced to the well-knownproblem of bipartite matching [27], where end-hosts are thevertices, and flows between the end-hosts are the edges inthe bipartite graph (Figure 4c). The network utilization thenequals the size of the matching; for example, so called maxi-mum matching corresponds to the optimal utilization for thegiven traffic matrix [27].

Unfortunately, all the known algorithms for maximummatching are computationally expensive [7, 27]. LoCoswitches, hence, rely on the classical greedy algorithm foronline matchings [22], that achieves maximal matching, whileguaranteeing the output matching size within a factor of 2× ofthe maximum matching size. The algorithm itself is extremelysimple, adding an edge to the matching if both the source andthe destination end-hosts are unmatched, and hence by usingthis algorithm, LoCo is not only able to keep its switch designsimple but is also able to guarantee that its network utilizationis at least 50% of the optimal utilization for that traffic matrix.This is the best guarantee any matching algorithm can providefor LoCo’s model (flows arriving over time relates to so-callededge-arrival model in online bipartite matching) [10].

3.1.2 LoCo for Multi-tier Multi-rooted topologies

The core technique used in generalizing LoCo from the caseof a single switch network to multi-tier multi-rooted treetopologies is to exploit the structure in the topology itself— We recursively decompose a multi-tier multi-rooted topol-ogy into a collection of logical single switch networks, each ofwhich independently orchestrates the admission and schedul-ing of packets using the idea described for the single switchcase. §3.3.1, §3.3.2, and §3.4 describe this design in detail.

LoCo’s design is fully asynchronous, and does not requireany explicit coordination between the various componentsin the decomposed topology, nor assumes any time synchro-nization between network switches and end-hosts; each LoCoswitch within each logically decomposed network operatescompletely independently in making decisions every t(p,B)time units and end-hosts simply react to the PULL requestsfrom the switches.

3.1.3 LoCo End-host

End-hosts in LoCo do not release packets into the switchingfabric voluntarily, but rather release packets in response tothe PULL packets received from the switches. The high-levelarchitecture of a LoCo end-host is shown in Figure 2. Eachend-host maintains a set of egress queues, typically one perdestination, and packets from applications are mapped into

the appropriate egress queue. Packets from the egress queuesare released into the switching fabric via a scheduler, whichparses the PULL packets and then sends out the packet fromappropriate egress queue as requested in the PULL packet.Thus, the drain rate of the egress queues is governed by thePULL packets, which, in turn, is governed by LoCo’s admis-sion control algorithm. This could result in a mis-match be-tween the rate packets are being generated by the applicationsand the drain rate of the egress queues, resulting in congestionat the egress queues. However, since this congestion is localto each end-host, the congestion control algorithm at the end-host can react very effectively to the congestion by simplyapplying the appropriate back-pressure to the applications.Thus, LoCo greatly reduces the complexity of congestion con-trol algorithm at each end-host, by pushing the complexity ofmanaging congestion in the switching fabric to the admissioncontrol algorithm running at the network switches.

3.2 LoCo ResultsLoCo guarantees that congestion is localized to end-hostegress queues, while providing worst-case bounds on net-work utilization. In this subsection, we first outline the set ofassumptions under which LoCo’s theoretical guarantees holdand then formally state the guarantees.

Assumptions for theoretical guarantees. LoCo’s theoreti-cal guarantees hold under the following assumptions:

• Multi-tier multi-rooted datacenter network topologies,e.g., FatTree [4] and VL2 [16].

• All data packets are of the same size (as are all controlpackets). §4.3 describes how to configure this value.

• No hardware failures. §3.5 describes handling failures.• No control packet loss. §3.6 describes handling losses.• All links within the same tier of the topology have the

same propagation delay; this assumption can be relaxedwith slightly worse bounds on queuing (§3.7).

• Large-sized flows in multiples of data packet size. Thisassumption is only needed for the utilization bound, and§6.3 discusses the implications when it does not hold.

• Network built out of identical k−ary switches with iden-tical per port bandwidth, e.g., FatTree [4]. This assump-tion is purely for the ease of analysis, as LoCo’s designwould decompose a switch port with higher bandwidthinto multiple logical ports with smaller bandwidth, andsimilarly decompose a switch with higher port count intomultiple logical switches with smaller port count, untilwe have a network with identical logical components.

While LoCo needs these assumption to provide its theoret-ical guarantees, it does not enforce them in its design andimplementation — we will show in §5 and §6 that, over theevaluated workloads (which do not meet many of the aboveassumptions), LoCo performance is still better than severalstate-of-the-art congestion control protocols. Finally, we notethat even with the above assumptions, none of the existing dat-acenter network designs provide guarantees similar to LoCo.

4

Theorem 1 (Queue bound) Under the above mentioned as-sumptions, LoCo guarantees that, for any arbitrary workload,the data queues at each network switch port is bounded by kdata packets, where k is the number of switch ports.Theorem 2 (Network utilization bound) Under the abovementioned assumptions and ignoring the overhead of controlpackets and assuming full bisection bandwidth, LoCo guaran-tees that, for any arbitrary workload, the network utilization isat least within 50% of the theoretically achievable utilizationfor that workload. This bound is tight.The full bisection bandwidth assumption is necessary forproviding network utilization bound; providing worst-caseutilization bounds for oversubscribed topologies is an openproblem [3]. Here we also note that several prior works thatprovide some form of performance guarantees also assumefull bisection bandwidth [3, 6, 30]. Queue bounds in LoCo,however, hold even for an oversubscribed topology (§3.4).

3.3 LoCo for Multi-tier Multi-rooted topolo-gies with Full Bisection Bandwidth

In this section, we describe LoCo’s design for multi-tier multi-rooted topologies, assuming full bisection bandwidth.

3.3.1 LoCo for 2-tier Multi-rooted Topologies

We start with 2-tier multi-rooted topologies with full bisectionbandwidth, built using k−ary switches (Figure 5a). Such atopology comprises k2

/2 end-hosts, k/2 tier-2 switches, k tier-1 switches. Our design is based on two key ideas.

First, as shown in Figure 5a-b, LoCo logically decomposesthe topology into k/2 single switch networks. Each of thedecomposed virtual network operates like a single switchas follows. Each of the tier-1 switches simply act as a relayswitch (forwarding packets between the end-hosts and the cor-responding tier-2 switches, with no additional functionality);thus, each tier-2 switch is directly connected to each of theend-hosts over a virtual bandwidth of B/(k/2) (Figure 5c).

Second, recall from §3.1.1 the LoCo admission controlprotocol for a single switch. Next, we generalize the protocolto 2-tier topologies. After the 2-tier topology is decomposedinto k/2 single switches, each switch has k2

/2 virtual ports,each connected to one of the end-hosts with a bandwidthof B/(k/2). Thus the memory requirements at each switchare now k4

/4 bits, which are significantly smaller than whattoday’s switches support (e.g., for k = 64, the memory require-ments are around 524KB). Moreover, since the bandwidth ineach of the virtual switches has reduced by a factor of k/2,each tier-2 switch will now generate a PULL request a factorof k/2 slower than the full-bandwidth single switch case; end-hosts still send one p-sized packet per pull request. Thus, thefollowing two invariants are maintained:Invariant 1 Each tier-2 switch can issue at most one pullrequest per t(p,B/(k/2)) time to any given source.Invariant 2 Each tier-2 switch can issue at most one pullrequest per t(p,B/(k/2)) time for any given destination.

L0 L1 L2 L3

B

B

tier-2

tier-1

E00 E01 E10 E11 E20 E21 E30 E31

E00 E01 E10 E11 E20 E21 E30 E31

B/2

E00 E01 E10 E11 E20 E21 E30 E31

B/2

(a) (b)

(c)

L0 L1 L2 L3

E00 E01 E10 E11 E20 E21 E30 E31

S1S1B/2

B/2

S1

S0S0

S0

Figure 5: LoCo logically decomposes a 2-tier topology intok/2 single switches, each of which orchestrates the admissionand scheduling of packets completely independently fromother switches. In this way, LoCo is able to use the samedesign as for a single switch (§3.1) for a 2-tier topology withthe queue bound increasing from 0 to k. Details in §3.3.1.

Algorithm 1 LoCo Admission Control Algorithm[for a full bisection bandwidth network]

1: T[ ][ ]: current traffic matrix2: procedure ADMITPACKETS

3: for every t(p,B/(k/2))-transmission time do4: Mark all destinations unmatched5: Randomize the virtual port list ▷ # virtual ports = k2

/26: for each virtual port i in the list do7: Select a random unmatched dest d with T[i][d] = 18: Mark d as matched9: Send a pull req on virtual port i for one packet for d

For routing of data packets, LoCo uses a simple mechanism:pull requests specify the destination identifier as well as thetier-2 switch identifier that issues the pull request; data packetsare forwarded using tier-2 switch identifier on the upstreampath, and using the destination identifier on the downstreampath. If the destination is within the same tier-1 switch, thenthe data packet is routed directly to the destination. Thisrouting mechanism along with the above invariants turn outto be sufficient to prove our bounds.

Each tier-2 switch, without any explicit coordination withother tier-2 switches, orchestrates the admission and thescheduling of each and every packet in LoCo over the samephysical topology; thus, the PULL requests from differentswitches and the data response from each end-host, while care-fully orchestrated over the decomposed virtual topology, maystill overlap on the physical topology, thus precluding zero-queuing as in a single switch network. However, note thatt(p,B/(k/2)) is the transmission time for a p-sized packetover a link of bandwidth B/(k/2). Since each tier-2 switchhas a dedicated virtual bandwidth of B/(k/2) to each end-host, the rate at which pull requests are generated and the datapackets are sent by any end-host over each of the decomposed

5

virtual topologies exactly matches the access link bandwidthfor each end-host. Thus, LoCo simultaneously avoids persis-tent queue build-up at the switches (bounded queuing), whilealso fully utilizing the end-host access link bandwidth.

Algorithm 1 shows one approach to implement LoCo’sadmission control mechanism, using which LoCo achievesthe guarantees mentioned in §3.2 (proof in Appendix A).

3.3.2 LoCo for 3-tier Multi-rooted Topologies

We discuss LoCo design for 3-tier multi-rooted topologies. Tokeep the discussion concise, we focus specifically on a k-aryFatTree [4] comprising k3

/4 end-hosts, k2/2 tier-1 switches,

k2/2 tier-2 switches, and k2

/4 tier-3 switches (Figure 6a).The core challenge in generalizing LoCo to 3-tier topolo-

gies is that maintaining the traffic matrix at per-host gran-ularity at the tier-3 switches now becomes prohibitive. In-deed, since a 3-tier multi-rooted tree topology may have asmany as k3

/4 end-hosts, the size of such a matrix would be(k3/4)2 = k6

/16 bits, which is prohibitive even for small val-ues of k, e.g., for k = 64, storing the matrix would require0.54GB of memory at each tier-3 switch.

To overcome the above challenge, LoCo maintains the traf-fic matrix at the granularity of tier-1 switches. Specifically,suppose the tier-1 switches are numbered from 1 to k2

/2.Then, each tier-3 switch now maintains a k2

/2×k2/2 traffic

matrix, where an entry (i, j) is set to 1 if there is a packet tobe sent from any end-host connected to tier-1 switch num-bered i to any end-host connected to tier-1 switch numberedj. This not only reduces the memory requirements at tier-3 switches significantly (e.g., for k = 64, each switch needsroughly 0.5MB of memory, well within practical limits), butalso allows LoCo to completely decompose the resultingtopology into k/2 “parallel” 2-tier multi-rooted tree topolo-gies that share absolutely no resources (see Figure 6b). Eachof these topologies can operate independently using LoCo’sdesign for 2-tier topologies from the previous subsection.

However, one challenge remains. Each tier-1 switch be-longs to k/2 parallel 2-tier networks, each of which indepen-dently performs the admission control using the traffic matrixat tier-1 switch granularity. And while this bounds the queuesizes for tier-3 and tier-2 switches similar to the queue boundvalues for tier-2 and tier-1 switches respectively in the 2-tiernetwork, it does not guarantee bounded queuing for queuesbetween tier-1 switches and the end-hosts. The reason beingthat since each tier-1 switch belongs to k/2 parallel 2-tiernetworks, and since all these parallel networks are operatingindependently, for certain traffic patterns we could end-upoverwhelming the access bandwidth of end-hosts within atier-1 switch by generating superfluous PULL requests. Onthe destination-side, consider an example of incast where k/2source end-hosts within different tier-1 switches have packetsto send to the same destination end-host within a specifictier-1 switch; since all the k/2 parallel networks are operatingindependently, each one of them would "pull" k/2 packets of

C0

E00E01

Btier-3

tier-2

tier-1

(a)

(b)

PARALLEL NETWORK 0

C1 C2 C3

S0 S1 S2 S3 S4 S5 S6 S7

L0 L1 L2 L3 L4 L5 L6 L7

E10E11 E20E21E30E31 E40E41E50E51E60E61E70E71

B

B

L0 L1 L2 L3 L4 L5 L6 L7 L0 L1 L2 L3 L4 L5 L6 L7

PARALLEL NETWORK 1

B B

(c)E00 E10 E20 E30 E40 E50 E60 E70 E01 E11 E21 E31 E41 E51 E61 E71

C0 C1

S0 S2 S4 S6

BC2 C3

S1 S3 S5 S7

B

PARALLEL NETWORK 0

L0 L1 L2 L3 L4 L5 L6 L7 L0 L1 L2 L3 L4 L5 L6 L7

PARALLEL NETWORK 1

B B

C0 C1

S0 S2 S4 S6

BC2 C3

S1 S3 S5 S7

B

Figure 6: (a) A FatTree network topology comprising k=4-port switches. (b) Converting a FatTree network into k/2 par-allel 2-tier networks. (c) Mapping each destination end-hostto a specific parallel 2-tier network, e.g., all packets destinedto E00 "pulled" from corresponding source tier-1 switchesexclusively within parallel network 0 (C0,C1).

size p each (one per tier-3 switch) for a total of k2/4 packets

for the destination over t(p,B/(k/2)) time according to Al-gorithm 1, thus resulting in queuing at the destination (whichcould only accept k/2 p-sized packets over t(p,B/(k/2)) timewithout overwhelming its bandwidth). A similar problem canalso happen at the source-side, where a tier-1 switch mayreceive superfluous PULL requests from the parallel networks,all of them to be forwarded to the same source end-host. Weuse a simple technique to resolve this issue.

Selective Notification Protocol. LoCo deterministicallymaps each destination to one of the k/2 logically decomposedparallel networks. For any destination d, the flow notificationcontrol packets for d are now forwarded by tier-1 switchesonly to the corresponding parallel network. Thus, exactly oneparallel network has the ability to admit packets for any givendestination. This resolves the destination-side problem.

More specifically, as illustrated in Figure 6c, the parallel2-tier networks are numbered from 0 to (k/2)− 1, and theend-hosts connected to tier-1 switch Li are numbered fromEi,0 to Ei,k/2−1. Next, a destination end-host Ei, j is mapped to

6

L0 L1 L2 L3

B

B

tier-2

tier-1

E00 E01 E10 E11 E20 E21 E30 E31

E00 E01 E10 E11 E20 E21 E30 E31

B/2

(a) (b)

(c)

L0 L1 L2 L3

E00 E01 E10 E11 E20 E21 E30 E31

B/2

B

S0S0

S0 S0

E00 E01 E10 E11 E20 E21 E30 E31

BB

(d)

Figure 7: (a) A 2:1 oversubscribed 2-tier topology. (b-c) De-composing the oversubscribed topology according to §3.3.1.(d) Decomposing the oversubscribed topology that ensuresboth bounded queuing and high utilization, details in §3.4.

parallel network j, and the flow notification for destinationEi, j from any source is forwarded only on parallel network j.Thus, j is the only parallel network that admits packets forthis destination, resolving the destination-side problem.

To resolve the source-side problem, tier-1 switches ensurethat at most k/2 PULL requests are forwarded to any partic-ular source end-host per t(p,B/(k/2)) time, and the rest aredropped. The PULL requests to be forwarded can be selectedbased on a custom policy, e.g., in a fair queue manner toachieve fairness across multiple parallel networks.

Together, the design achieves the guarantees mentioned in§3.2 (proof in Appendix A).

3.4 Generalizing LoCo for OversubscribedMulti-tier Multi-rooted Topologies

In this section, we generalize LoCo’s design to oversubscribedmulti-tier multi-rooted topologies. In an oversubscribed topol-ogy, the core of the network supports less than full bisectionbandwidth, while the edge of the network still supports fullbisection bandwidth. We define the core of the network as thelinks between the highest tier and the tier immediately belowit, while the lower tiers are considered as the edge, and henceare equipped with full bisection bandwidth.

We focus on 2-tier multi-rooted oversubscribed topologies,which can then be used as the building block for 3-tier multi-rooted oversubscribed topologies via the decomposition tech-nique described in §3.3.2. Figure 7a shows an example 2-tieroversubscribed topology with an oversubscription ratio of2:1. A strawman design would decompose the 2-tier over-subscribed topology into a collection of virtual single switchnetworks in exactly the manner we did for a full bisectionbandwidth 2-tier topology, illustrated in Figure 7c. And whilethis design would provide bounded queuing, it would come atthe cost of link under-utilization, as unlike the full bisectionbandwidth topology (Figure 5), the access links of end-hosts

Algorithm 2 Generalized LoCo Admission Control Algorithm

1: T[ ][ ]: current traffic matrix2: Oversubscription ratio of X :13: procedure ADMITPACKETS

4: for every t(p,B/(k/2))-transmission time do5: Mark all destinations unmatched6: # pkt scheduled on each physical tier-2 switch port = 07: Randomize the virtual port list ▷ # virtual ports = k2

/28: for each virtual port i in the list do9: if # pkt scheduled on physical port corresp. to i

10: == k/2 then11: continue12: Select a random unmatched dest d with T[i][d] = 113: Mark d as matched14: Send a pull req on virtual port i for X packets for d

in the decomposed oversubscribed topology would only beable to operate at a maximum of B/X bandwidth (Figure 7c),where B is the access bandwidth and X is the oversubscrip-tion factor. Fundamentally, this is because according to Al-gorithm 1, each end-host sends/receives at most one p-sizedpacket per virtual single switch network during t(p,B/(k/2))time, and hence utilizes only B/(k/2) bandwidth per singleswitch network. In a full bisection bandwidth 2-tier topology,there are k/2 decomposed virtual single switch networks andhence the access links are fully utilized using Algorithm 1,but in a X :1 oversubscribed 2-tier topology, there are only(k/2)/X decomposed virtual single switch networks, thus re-sulting in link under-utilization. To overcome this issue, wedecompose the 2-tier oversubscribed topology according toFigure 7d, where each physical tier-2 switch port multiplexesbetween its k/2 virtual ports, and incorporate this design byextending Algorithm 1 using two key insights, presented as ageneralized LoCo admission control algorithm in Algorithm 2,which ensures both bounded queuing (same bounds as for afull bisection bandwidth network) and high link utilization—(i) First, whenever a source-destination pair is matched duringany t(p,B/(k/2)) time, we schedule X packets instead of one(line 14 in Algorithm 2), thus allowing for full utilizationof the access link. However, this could result in congestionat a tier-2 switch port if say all the k/2 virtual ports con-nected to it send X packets during t(p,B/(k/2)) time, sinceeach tier-2 port could only sustain at most k/2 packets pert(p,B/(k/2)) time without causing congestion. This is wherewe use the second insight, (ii) We explicitly keep track ofnumber of packets scheduled on each tier-2 switch port, andas soon as we have scheduled k/2 packets on a port duringa t(p,B/(k/2)) time, we stop scheduling on that port (lines9-11 in Algorithm 2; packets scheduled between end-hostswithin the same tier-1 switch are not counted, as that trafficis contained within the tier-1 switch and does not consumetier-2 switch port bandwidth). Note that in Algorithm 1, weimplicitly ensured that at most k/2 packets per tier-2 switchport are scheduled during any t(p,B/(k/2)) time, on account

7

of scheduling one packet per pull request on each of the k/2virtual ports corresponding to each tier-2 switch port. Thus,for a full bisection bandwidth topology, i.e., oversubscriptionratio of 1:1, Algorithm 2 reduces to Algorithm 1.

3.5 Handling FailuresOne of the assumptions that LoCo makes to provide theoreti-cal guarantees is that no links or switches fail in the network.However, while LoCo loses its theoretical guarantees duringfailures, it uses a simple mechanism to detect, and to react tofailures in a timely manner. We now describe this mechanism.

Each node in the network transmits a “dummy” packet ifit does not have a data packet to transmit (at the end-host)or forward (at the switches). Next, we describe how thesedummy packets enable a simple failure detection mechanismin LoCo, and then discuss how LoCo adapts its admissioncontrol mechanism upon detecting a failure.Failure detection. LoCo enforcing each node to transmit con-tinuously (either real or dummy data) enables a very simplefailure detection mechanism — if a node does not receivedata from one of its neighbor for a certain time duration, it as-sumes the neighbor (or the link connecting the neighbor) hasfailed. This technique is inspired from the Ethernet’s physicallayer design, which continuously transmits idle bits betweenany two directly connected nodes if there is no data to trans-mit [13, 24, 25]. In this work, we simply extend this idea tolayer-2 for failure detection.Reacting to failures. Whenever a node detects the failure ofone of its neighbor, it stops forwarding the data packets to thatneighbor; in addition, all the PULL requests to be forwardedon that outgoing link are dropped. Since LoCo’s admissioncontrol runs at each switch without any explicit coordinationacross switches, LoCo does not require any updates in theadmission control logic upon a failure detection — PULLrequests may be generated but will eventually be dropped atthe egress port of the failed link.

3.6 Handling Packet LossesData packet queues at LoCo switches are bounded, but lossescan still happen due to failures. Further, control packet queuesat switches are not theoretically bounded and can also seetail drops, albeit extremely rarely as control packets are verysmall (§5.1) and generated in a controlled manner (§3.1.1).

LoCo’s design is fairly robust against control packetlosses—(i) PULL packet generation for a flow stops onlyonce the switch receives the flow end notification, and if thePULL packets are getting lost, it would simply result in thecorresponding switch generating more PULL packets untilthe corresponding flow finishes, and (ii) if a flow notificationpacket is lost, the corresponding source re-tries after a time-out, if: (1) it has not received a PULL packet from the switchduring that time (assumes the flow start notification was lost),or (2) keeps receiving PULL requests even after the flow hasfinished (assumes the flow end notification was lost).

To recover from data packet losses, LoCo relies on underly-ing transport protocol to implement loss recovery mechanism.

3.7 Handling Non-uniform DelaysThe queue bounds in LoCo assume uniform propagation de-lays within each tier of the network topology (but not acrosstiers). This requires uniform wire lengths within each tier,which is hard to ensure in practice. In general, the queuebounds would increase in case of non-uniform propagationdelays, and this increase would depend upon the maximumvariability in the wire lengths within each tier. For instance, ifthe propagation delays vary by at most a t(p,B), the queuebounds would increase by at most one. At 10Gbps link speedsand MTU-sized packets, t(p,B) = 1.2µs (120 ns at 100 Gbps).Given light travels at 5 ns/ m in optical wires, as long as no twowire lengths differ by more than 240 m (24 m at 100 Gbps),LoCo’s queue bounds in §3.2 would increase by at most one.

4 ImplementationIn this section, we describe the FPGA implementation ofLoCo NIC and switches for a k-ary FatTree [4]. We used Blue-spec System Verilog [9] for implementation (∼2500 LOCs).

Figure 8 shows the end-to-end implementation of LoCo.LoCo maintains two key datastructures: (i) a k/2×k3

/4 binarytraffic matrix at each tier-1 switch, storing all the active flowssourced at the end-hosts connected to the tier-1 switch, and(ii) a k2

/2× k2/2 binary traffic matrix at each tier-3 switch,

storing all the active flows between tier-1 switches.For the sake of brevity, we only describe in detail the im-

plementation of LoCo’s admission control algorithm (Algo-rithm 2), which sits at the core of LoCo’s design.

4.1 Implementing LoCo Admission ControlFlow notifications from end-hosts are converted into corre-sponding tier-1-layer flow notifications by the tier-1 switch,which then forwards them on the correct port to a tier-2 switch,according to the selective notification algorithm (§3.3.2). Tier-2 switch forwards each received flow notification on all itsports, so that all tier-3 switches receive it in parallel. Thus, ittakes RT T /4 for a flow notification to reach from an end-hostto all the tier-3 switches. Each tier-3 switch updates its corre-sponding traffic matrix on receipt of each flow notification.

Tier-3 switches store the traffic matrix in SRAM. We as-sume the SRAM comprises multiple dual-port blocks, withaccess latency of one clock cycle per block. This is supportedin both the FPGAs [20] and modern switches [31]. Each col-umn of the matrix is stored on a separate SRAM block, thusallowing us to access an entire row of the matrix in one clockcycle. Note that each row is indexed by virtual port (source).We use one port of SRAM for updating the traffic matrix onreceipt of a flow notification packet, and, in parallel, use theother port to access matrix entries as needed by Algorithm 2.

Every t(p,B/(k/2) time, each tier-3 switch runs lines 5-14of Algorithm 2 for each of its virtual port. A tier-3 switch with

8

Generateflow

notification

Parsefromtier-1

switch

totier-1

switch

Network InterfaceCard (NIC)

Data Ctrl

Pars

e

10

001

110

1 0

0

0

(end-host)flowstart/end

(tier-1)flowstart/endSelective

notification

Pars

e

output ports

fromtier-2

switch

totier-2

switch

fromtier-3

switch

totier-3

switch

traffic matrix

10

001

110

1 0

0

0

AdmissionControl

Host

dest

inat

ion

id

flow

st

art/e

nd

DM

A

Tier-1 switch Tier-3 switch

flowstart/end

traffic matrixPa

rse

Pars

e

Pars

e

Pars

e

Pars

e

virtual Q

Data Ctrlvirtual Q

Data Ctrlvirtual Q

Data Ctrlvirtual Q

output ports

Data Ctrlvirtual Q

Data Ctrlvirtual Q

output ports

output ports

Data Ctrlvirtual Q

Data Ctrlvirtual Q

Data Ctrl

Pars

e

Tier-2 switch

Pars

e

virtual Q

Data Ctrlvirtual Q

output ports

egressqueues

Flow notification packets PULL packets Data packets

Figure 8: Implementation of LoCo’s NIC and switches for FatTree topology. Implementation of tier-2 and tier-1 switches for a2-tier multi-rooted topology is the same as the implementation of tier-3 and tier-2 switches respectively for FatTree topology.

k physical ports has k2/2 virtual ports, with k/2 virtual ports

per physical port. Hence, in every t(p,B) time, we select kvirtual ports for matching, one per physical port. To keep trackof the matched virtual ports and destinations, we maintain twobit-vectors, unmatchedVPort and unmatchedDestination,which are cleared after every t(p,B/(k/2)) time. And finally,to keep track of the number of packets scheduled on eachphysical switch port, we maintain a vector pktScheduled,which is also cleared after every t(p,B/(k/2)) time.

Our hardware implementation takes four clock cycles tomatch each virtual port (lines 5-14 in Algorithm 2), as de-scribed below. We assume an oversubscription ratio of X :1.

• Cycle-1: We select a random unmatched virtual portby randomizing the order of entries in the bit vectorunmatchedVPort and feeding it to a priority encoder.

• Cycle-2: We issue a read request for the row in trafficmatrix corresponding to virtual port i returned by thepriority encoder. We also set unmatchedVPort[i] to 0,and increment pktScheduled[phyPort(i)] by X .

• Cycle-3: The read response is a bit-vector (row) B ofsize number of destinations. We do a bit-wise AND(B & unmatchedDestination), randomize the order ofentries in the output and feed it to a priority encoder.

• Cycle-4: We select the destination d as output bythe priority encoder, and send a PULL packet request-ing X packets for destination d on virtual port i. Wealso set the entry unmatchedDestination[d] to 0. IfpktScheduled[phyPort(i)] equals k/2, we set allthe virtual ports corresponding to the physical portphyPort(i) in unmatchedVPort to 0.

Our implementation is fully pipelined, thus iterating over kvirtual ports per t(p,B) time takes k+1 clock cycles.

4.2 Resource ConsumptionWe implemented LoCo’s NIC and switches on Altera StratixV FPGA [20]. NIC, tier-1 switch, tier-2 switch, and tier-3

# switch ports # end-hostsSRAM consumption (in MB)tier-1 tier-2 tier-3

32 8 K 1.5 1.5 1.564 65 K 6.4 6.1 6.696 0.2 M 15.1 13.8 16.4

128 0.5 M 28.7 24.5 32.9

Table 2: Total memory requirement at LoCo switches, whichis at par with commodity switches with same port count [1].

switch consumed 13%, 14%, 13%, and 16% resp. of the avail-able Logic Modules (ALMs). The SRAM consumption at theNIC was 9 KB, while at each switch it was ∼30 KB.

In general, for a FatTree topology with k-port switchesand MTU-sized packets, a tier-3 switch would consumek4/32 Bytes for storing the LoCo state, and 1500∗ k2 Bytes

for storing data packets, a tier-2 switch would consume1500∗k2 Bytes for storing data packets, and a tier-1 switchwould consume k4

/64 Bytes for storing the LoCo state, and1500∗ k2 Bytes for storing data packets. Table 2 shows thememory requirement at the switches against network size.

4.3 LoCo Parameter: Data Packet SizeTo provide the theoretical guarantees, LoCo’s design assumesfixed size data packets of size p (defined in §3.1.1), and givena network topology and the link bandwidth, data packet sizep is the key parameter in LoCo’s admission control algorithm(Algorithm 2). The admission control algorithm is iterativeover the switch ports, and hence, to be able to admit data atline rate, LoCo requires the data packet sizes to be at leastas large as the time it takes for the switches to generate aPULL request on all of its k egress ports, i.e., k+ 1 cycles(§4.1). Assuming 64-port switches operating at 1 GHz [31], itwould take 65 ns to generate a PULL request on all ports. At10 Gbps link speed, this translates to minimum data packetsize of 82 B (820 B at 100 Gbps).

9

(a) Network utilization. (b) Queuing at switch ports.

Figure 9: [Testbed] Network utilization and maximum queu-ing in LoCo testbed for various workloads.

5 TestbedIn this section, we evaluate our FPGA-based implementationof LoCo switch and NIC through a 8-node testbed.

5.1 Testbed SetupOur testbed comprises eight Terasic DE5-Net boards [37],each with an Altera Stratix V FPGA [20] and four 10 Gbpsports. Two FPGAs are used to implement eight 10G NICs,one per port. The remaining FPGAs are used to implementsix switches (4 tier-1 and 2 tier-2 switches). The switches areconnected in a 2-tier multi-rooted topology with eight nodesand full bisection bandwidth. The NICs in our testbed are notconnected to the host applications. Instead, we implement apacket generator to emulate applications. We use the packetgenerator to generate custom workloads for testing (§5.2).LoCo parameters. The key parameter in LoCo’s design isthe data packet size (§4.3). Our FPGA implementation of theswitch runs at 156.25 MHz, and each switch has four ports,running at 10 Gbps each. Hence, it will take 32 ns to generatea PULL request on all of its four ports, which translates to theminimum data packet size of 40 B. We choose data packetsize of 1500 B, and use the minimum Ethernet packet size of64 B as the control packet size, to minimize control overhead.

5.2 Testbed ExperimentsTo evaluate our testbed, we use the packet generator to gener-ate three representative datacenter workloads—(i) Permuta-tion workload, where each end-host sends and receives exactlyone long running flow, (ii) Incast workload, where each end-host simultaneously starts a flow to the same destination, and(iii) All-to-All workload, where each end-host sends and re-ceives a long running flow from every other end-host. Werun each workload for 10 minutes. For each workload, wereport the network utilization (Figure 9a), which is the ratioof the network throughput achieved to the optimal networkthroughput for that workload, and maximum queuing at theswitch egress ports observed during the course of the exper-iment (Figure 9b). For all three workloads, LoCo achievesnear optimal utilization, without violating the queue bound offour data packets.

6 SimulationIn this section, we evaluate LoCo using packet-level simula-tions over large-scale experiments.

6.1 Simulation SetupWe use a 144-node 2-tier multi-rooted topology with 10 Gbpslink speed. We assume a propagation delay of 200 ns per hop.The default configuration assumes full bisection bandwidth.LoCo parameters. We use data packet size of 1500 B andcontrol packet size of 64 B, same as in the testbed.Baselines. We use four baselines to compare the performanceof LoCo, chosen from four key classes of datacenter con-gestion control protocols—(i) a sender-driven protocol inDCTCP [5], (ii) a receiver-driven protocol in NDP [18], (iii)a protocol with in-network prioritization in pFabric [6], and(iv) a protocol over lossless Ethernet in DCQCN [36]. We usethe simulator from [18] to run DCTCP, NDP and DCQCN,and the simulator from [15] to run pFabric.

6.2 Robustness EvaluationWe evaluate LoCo against a range of traffic patterns observedin datacenters, and report the network utilization achieved.Permutation traffic. (Figure 10a) Each end-host sends andreceives exactly one long-running flow. Each tier-2 switchin LoCo independently "pulls" packets from each flow, thusuniformly distributing packets from each flow across all thetier-2 switches, resulting in optimal bandwidth utilization.Incast traffic. (Figure 10b) We simultaneously start one smallflow (15 KB long) at each end-host, all destined to the samedestination, and keep repeating the workload once all theprevious set of flows have finished. Incast traffic can resultin synchronized packet drops at switches, which could leadto arbitrarily low network throughput in protocols such asDCTCP. LoCo’s admission control algorithm at each switch,however, matches the destination with exactly one source at atime, and hence avoids synchronized packet drops.All-to-All traffic. (Figure 10c) In this experiment, each end-host sends a 25 KB flow to every other end-host, and we keeprepeating the workload once all the previous set of flowshave finished. All-to-All workloads are challenging to handlefor receiver-driven protocols such as NDP, as each receiverin NDP has only a partial view of the traffic matrix, whichcan cause NDP to match receivers to senders sub-optimally,resulting in congestion and packet drops. In contrast, witha network-wide admission control, where each switch has aglobal view of the traffic matrix, LoCo is able to make muchbetter matching decisions, resulting in higher performance.Permutation alongside Incast. (Figure 10d) We start a longrunning permutation traffic between all the end-hosts withintier-1 switches A (senders) and B (receivers). Next, we startan incast traffic with smaller sized flows (compared to thepermutation traffic), with all the remaining end-hosts send-ing to one particular end-host in B. Under protocols such as

10

(a) Permutation traffic. (b) Incast traffic. (c) All-to-All traffic. (d) Permutation alongside Incast.

Figure 10: Network utilization against different traffic patterns commonly observed in datacenters. Unlike baseline protocols,LoCo achieves high performance across all the evaluated traffic patterns.

pFabric and DCQCN, the congestion due to the incast trafficcauses collateral damage to the permutation traffic (dropsdue to SRPT [33] scheduling at the switches in pFabric, andpause due to PFC in DCQCN), resulting in arbitrarily lownetwork throughput. In contrast, with a network-wide admis-sion control, LoCo is able to avoid such interference amongstcompeting traffic.

6.3 LoCo Bad Case WorkloadsThere are two key aspects of LoCo’s design that could resultin degraded performance for certain workloads.

First, LoCo’s design assumes fixed size packets, and pro-vides a lower bound on the size to operate at line rate (§4.3).This lower bound is typically much smaller than MTU, and weconfigure LoCo to operate with MTU-sized packets to mini-mize the control overhead. However, if a flow size is smallerthan a MTU, then it would result in wasted throughput. In theworst-case, if all the flows in the workload were less than aMTU, LoCo would result in low network utilization. How-ever, we also note that such workloads are quite extreme, andin most datacenter workloads, most of the bytes would comefrom flows significantly larger than a MTU (§6.4). Neverthe-less, optimizing a LoCo-like network design for these extremeworkloads is an interesting future direction.

Second, LoCo’s design incurs RT T /2 flow start-up over-head: RT T /4 to notify the highest tier switches of flow start(§4.1), and another RT T /4 to receive the first PULL request,before transmitting any data. Note that this overhead is amor-tized across all the layer-4 flows active between the samesource and destination end-hosts. The start-up overhead is theprice we pay to proactively avoid congestion in the switch-ing fabric and ensure small network queuing and zero packetdrops. However, this could be a significant overhead for verysmall-sized flows at very low load (with little to no conges-tion). The trade-off made in LoCo is however justified as soonas the delays due to network queuing and packet drops startto dominate the end-to-end delay. We observe this in §6.4.

6.4 Evaluation over Datacenter WorkloadsIn this section, we evaluate LoCo against standard datacenterworkloads used in several prior works [5, 6, 8, 15–17, 29].

Workloads. We derive the flow distribution from three differ-ent workloads (detailed flow size distribution in [15]):

1. IMC workload from a production datacenter [8, 15].2. Web search workload used in [5, 6, 17].3. Data mining workload used in [6, 16, 17].

All three workloads are heavy-tailed, i.e., most of the flowsare short but most of the bytes are in the long flows.Traffic matrix. We use all-to-all traffic matrix, where eachend-host can send and receive flows from all other end-hosts.Evaluation metrics. (i) Network utilization: ratio of networkthroughput achieved to optimal network throughput for agiven workload, and (ii) Slowdown in flow completion time(FCT): slowdown(f) = (FCT(f) / FCT(f) in an idlenetwork), i.e., FCT normalized by flow size. These metricshave been widely used in prior works [3, 6, 15, 17, 29].Results. We begin our evaluation on a full bisection band-width topology. In terms of network utilization, LoCoachieves comparable performance to all the baseline proto-cols across all load values (Figure 11). Next, we focus onFCT slowdown at load 0.6 (the highest load value at which allprotocols are stable). LoCo significantly outperforms DCTCP,NDP and DCQCN for IMC (by 3× at mean and 4–12× attail) and Data mining (by 2–4× at mean and 4–16× at tail)workloads, while achieving comparable performance for Websearch workload (Figure 12). To see how FCT varies withflow sizes (Figure 13), we find that LoCo achieves signifi-cantly lower FCT for small-sized flows compared to DCTCP,NDP and DCQCN, which can be attributed to small networkqueuing and zero packet drops in LoCo, while achieving com-parable FCT for long-sized flows. Finally, Figure 14 showsthat the above trends also hold for an oversubscribed topology.

pFabric achieves better FCT than all the evaluated pro-tocols, including LoCo, as unlike other protocols, pFabricis a clairvoyant design that assumes flow sizes are knownin advance, and uses that information to do in-network pri-oritization approximating SRPT [33] scheduling (known toachieve optimal FCTs). However, knowledge of flow sizes inadvance is not easy in practice [38]. In contrast, LoCo is amore general design that assumes no prior knowledge of flowsizes, and uses (approximate) fair queuing as its scheduling

11

Figure 11: Network utilization for IMC (top), Web search(middle) and Data mining (bottom) workloads.

(a) Mean slowdown. (b) Tail slowdown.

Figure 12: Slowdown in flow completion time for evaluatedworkloads at Load 0.6.

policy, by randomly selecting src-dst pairs for matching. We,however, note that LoCo’s design is not fundamentally tied toa particular scheduling policy, and in principle, one can alsodo matching in LoCo in the order of some specified priorityas in pFabric. This, however, has certain practical challenges,such as maintaining larger switch state for priorities, and im-plementing an efficient priority queue in hardware (also aninherent challenge with pFabric). We leave incorporating pri-orities in LoCo’s design as a future work. Finally, we alsonote that pFabric’s design can lead to arbitrarily low networkthroughput for certain realistic workloads (Figure 10d).Key takeaways. There are two key takeaways from the eval-uation results: (i) LoCo provides worst-case queuing andutilization guarantees for arbitrary traffic patterns (§3.2), un-like any of the baseline protocols, while also achieving veryhigh performance across a range of realistic traffic patterns(§6.2). In contrast, each of the baseline protocols can resultin arbitrarily low network throughput for realistic traffic pat-terns (§6.2). (ii) And with an exception of pFabric, LoCo alsoachieves significantly better performance over standard data-center workloads compared to the baseline protocols (§6.4).

7 Related WorkWe compared LoCo against several related works in prior sec-tions (§2, §6). The novel aspect of LoCo’s design that differen-tiates it from those works is that LoCo localizes the congestionto end-host egress queues, while providing worst-case boundson both the queuing at each switch and the network utiliza-


Figure 13: Breakdown of slowdown in flow completion timeacross flow sizes for IMC workload at Load 0.6.


Figure 14: Slowdown in flow completion time for evaluatedworkloads at Load 0.6 and oversubscription ratio of 2:1.

tion. Fastpass [30] could provide an even stronger guaranteeof zero-queuing and bounded utilization, but it is not scalableas it requires a central scheduler, whereas LoCo can scale toan entire datacenter. PFC [2] and ExpressPass [12] could guar-antee bounded queuing, but provide no worst-case throughputguarantees. In fact, as we show in §2, PFC based protocolscould result in arbitrarily low network throughput for certainworkloads. On the flip side, load balancing schemes, such asValiant-LB [32], could achieve similar worst-case utilizationbound as LoCo, but provide no queuing guarantees. Finally,QJump [17] provides applications the flexibility to choosebetween bounded queuing or high throughput, but cannotsimultaneously provide both to the same application.

8 ConclusionWe presented LoCo, a datacenter network design that takesa new approach to resolving the congestion control problem:it localizes the congestion to egress queues of the end-hosts,while providing worst-case bounds on both the queuing ateach switch and the network utilization. LoCo achieves thisby exploiting the structure in datacenter network topologies toperform network-wide admission control for each and everypacket in the network. Our evaluation results demonstratethat LoCo not only achieves the above mentioned guarantees,but also achieves high performance over standard datacenterworkloads. Finally, we note that LoCo’s design also presentsan interesting new direction in the space of designing net-works for RDMA, and CPU-efficient network stacks.

12

References

[1] Buffer sizes at the switches. https://people.ucsc.edu/~warner/buffer.html, 2019.

[2] I. S. 802.3-2008. 10G Ethernet. http://standards.ieee.org/about/get/802/802.3.html,2008.

[3] S. Agarwal, S. Rajakrishnan, A. Narayan, R. Agarwal,D. Shmoys, and A. Vahdat. Sincronia: Near-optimal networkdesign for coflows. In SIGCOMM, 2018.

[4] M. Al-Fares, A. Loukissas, and A. Vahdat. A Scalable, Com-modity Data Center Network Architecture. In SIGCOMM,2008.

[5] M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel,B. Prabhakar, S. Sengupta, and M. Sridharan. Data CenterTCP (DCTCP). In SIGCOMM, 2010.

[6] M. Alizadeh, S. Yang, M. Sharif, S. Katti, N. McKeown,B. Prabhakar, and S. Shenker. pFabric: Minimal Near-optimalDatacenter Transport. In SIGCOMM, 2013.

[7] T. E. Anderson, S. S. Owicki, J. B. Saxe, and C. P. Thacker.High-speed switch scheduling for local-area networks. ACMTransactions on Computer Systems (TOCS), 11(4):319–352,1993.

[8] T. Benson, A. Akella, and D. A. Maltz. Network Traffic Char-acteristics of Data Centers in the Wild. In IMC, 2010.

[9] Bluespec. BSV High-Level HDL. https://bluespec.com/54621-2/, 2019.

[10] N. Buchbinder, D. Segev, and Y. Tkach. Online Algorithmsfor Maximum Cardinality Matching with Edge Arrivals. InESA, 2017.

[11] P. Cheng, F. Ren, R. Shu, and C. Lin. Catch the whole lot in anaction: Rapid precise packet loss notification in data centers.In NSDI, 2014.

[12] I. Cho, K. Jang, and D. Han. Credit-scheduled delay-boundedcongestion control for datacenters. In SIGCOMM, 2017.

[13] I. DCB. 802.1Qbb - Priority-based Flow Control. http://www.ieee802.org/1/pages/802.1bb.html, 2011.

[14] N. Dukkipati. Rate Control Protocol (RCP): Con-gestion control to make flows complete quickly.http://yuba.stanford.edu/ nanditad/thesis-NanditaD.pdf,2007.

[15] P. X. Gao, A. Narayan, G. Kumar, R. Agarwal, S. Ratnasamy,and S. Shenker. pHost: Distributed Near-Optimal DatacenterTransport Over Commodity Network Fabric. In CoNEXT,2015.

[16] A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim,P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta. VL2: Ascalable and flexible data center network. In SIGCOMM,2009.

[17] M. P. Grosvenor, M. Schwarzkopf, I. Gog, R. N. M. Watson,A. W. Moore, S. Hand, and J. Crowcroft. Queues Don’t MatterWhen You Can JUMP Them! In NSDI, 2015.

[18] M. Handley, C. Raiciu, A. Agache, A. Voinescu, A. Moore,G. Antichi, and M. Wojcik. Re-architecting datacenter net-works and stacks for low latency and high performance. InSIGCOMM, 2017.

[19] C.-Y. Hong, M. Caesar, and P. B. Godfrey. Finishing flowsquickly with preemptive scheduling. In SIGCOMM, 2012.

[20] Intel. Stratix V FPGA. https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/stratix-v/stx5_51001.pdf, 2015.

[21] V. Jacobson. Congestion avoidance and control. In SIG-COMM, 1988.

[22] R. M. Karp, U. V. Vazirani, and V. V. Vazirani. An optimalalgorithm for on-line bipartite matching. In STOC, 1990.

[23] D. Katabi, M. Handley, and C. Rohrs. Congestion Controlfor High Bandwidth-Delay Product Networks. In SIGCOMM,2002.

[24] K. S. Lee, H. Wang, V. Shrivastav, and H. Weatherspoon. Glob-ally Synchronized Time via Datacenter Networks. In SIG-COMM, 2016.

[25] K. S. Lee, H. Wang, and H. Weatherspoon. SoNIC: PreciseRealtime Software Access and Control of Wired Networks. InNSDI, 2013.

[26] Y. Li, R. Miao, H. H. Liu, Y. Zhuang, F. Feng, L. Tang, Z. Cao,M. Zhang, F. Kelly, M. Alizadeh, and M. Yu. Hpcc: Highprecision congestion control. In SIGCOMM, 2019.

[27] N. McKeown. The iSLIP scheduling algorithm for input-queued switches. IEEE/ACM Transactions on Networking,7(2):188–201, 1999.

[28] R. Mittal, T. Lam, N. Dukkipati, E. Blem, H. Wassel,M. Ghobadi, A. Vahdat, Y. Wang, D. Wetherall, and D. Zats.TIMELY: RTT-based Congestion Control for the Datacenter.In SIGCOMM, 2015.

[29] B. Montazeri, Y. Li, M. Alizadeh, and J. Ousterhout. Homa:A Receiver-Driven Low-Latency Transport Protocol UsingNetwork Priorities. In SIGCOMM, 2018.

[30] J. Perry, A. Ousterhout, H. Balakrishnan, D. Shah, and H. Fu-gal. Fastpass: A Centralized “Zero-queue" Datacenter Net-work. In SIGCOMM, 2014.

[31] A. Sivaraman, S. Subramanian, M. Alizadeh, S. Chole, S.-T.Chuang, A. Agrawal, H. Balakrishnan, T. Edsall, S. Katti, andN. McKeown. Programmable Packet Scheduling at Line Rate.In SIGCOMM, 2016.

[32] L. G. Valiant. A scheme for fast parallel communication.SIAM Journal on Computing, 11(2):350–361, 1982.

[33] Wikipedia. Shortest Remaining Time First. https://en.wikipedia.org/wiki/Shortest_remaining_time,2019.

[34] Y. Xia, L. Subramanian, I. Stoica, and S. Kalyanaraman. OneMore Bit Is Enough. In SIGCOMM, 2005.

[35] D. Zats, A. P. Iyer, G. Ananthanarayanan, R. Agarwal, R. Katz,I. Stoica, and A. Vahdat. FastLane: Making short flows shorterwith agile drop notification. In SoCC, 2015.

[36] Y. Zhu, H. Eran, D. Firestone, C. Guo, M. Lipshteyn, Y. Liron,J. Padhye, S. Raindel, M. H. Yahia, and M. Zhang. CongestionControl for Large-Scale RDMA Deployments . In SIGCOMM,2015.

[37] DE5-Net FPGA development kit. http://de5-net.terasic.com.tw.

[38] V. Ðukic, S. A. Jyothi, B. Karlas, M. Owaida, C. Zhang, andA. Singla. Is advance knowledge of flow sizes a plausibleassumption? In NSDI, 2019.

13

https://people.ucsc.edu/~warner/buffer.html

https://people.ucsc.edu/~warner/buffer.html

http://standards.ieee.org/about/get/802/802.3.html

http://standards.ieee.org/about/get/802/802.3.html

https://bluespec.com/54621-2/

https://bluespec.com/54621-2/

http://www.ieee802.org/1/pages/802.1bb.html

http://www.ieee802.org/1/pages/802.1bb.html

https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/stratix-v/stx5_51001.pdf



https://en.wikipedia.org/wiki/Shortest_remaining_time

https://en.wikipedia.org/wiki/Shortest_remaining_time

http://de5-net.terasic.com.tw

http://de5-net.terasic.com.tw

AppendixA Proof for LoCo ResultsCorollary 1 A tier-2 switch can issue at most n + 1 PULLrequests to any given source or for any given destinationduring any arbitrary n∗ t(p,B/(k/2)) time-slice.

Proof. Note that any arbitrary period of n∗t(p,B/(k/2)) timecan cover at most n+1 continuous blocks of t(p,B/(k/2))time. Hence, the corollary follows directly from Invaiant 1and Invariant 2. ∎

A.1 Proof for Theorem 1

Without loss of generality, we assume virtual output queuedswitches. Thus packets are queued at the virtual queues, oneper ingress queue, at each egress port.Part I: 2-tier multi-rooted topology with full bisectionbandwidth. We first consider the upstream egress queues attier-1 switches, i.e., queues storing data going from end-hoststo tier-2 switches. Each upstream egress link at a tier-1 switchis connected to one particular tier-2 switch. A tier-2 switchsends at most one PULL request per t(p,B) time on each ofits ports connected to a tier-1 switch. Since, an end-host sendsexactly one data packet (of size p) per PULL request, the rateat which data is being received at each upstream egress portof tier-1 switch matches the bandwidth of the egress link.Hence, there cannot be persistent queuing at the upstreamegress queues. However, there can be intermittent queuingif multiple end-hosts respond to a PULL request from thesame tier-2 switch simultaneously. And since there are k/2end-hosts per tier-1 switch, the upstream egress queues at atier-1 switch are bounded by k/2 packets.

Next, we consider the egress queues at tier-2 switch ports.Each egress queue at a tier-2 switch is connected to a par-ticular tier-1 switch, and hence has exactly k/2 destinationend-hosts under it. Using Corollary 1, the egress queue at atier-2 switch could have at most three outstanding packets perdestination in a time-slice of 2∗ t(p,B/(k/2)), making it atotal of at most 3k/2 outstanding packets per 2∗t(p,B/(k/2))time-slice. Further, using Invariant 2, the first and the thirdpackets within the time-slice to the same destination mustbe at least t(p,B/(k/2)) time apart, during which the egressqueue can drain k/2 packets, thus ensuring at any given timethere can be at most k packets queued at any given time.

Finally, we consider the downstream egress queues at tier-1switches, i.e., queues storing data going from tier-2 switchesto end-hosts. Each downstream egress queue correspondsto a particular destinaiton and has k/2 virtual queues, oneper ingress queue. Further, each downstrem ingress queuereceives data directly from a particular tier-2 switch egressqueue. As discussed in the previous paragraph, each tier-2switch egress queue receives at most three packets per desti-nation during a time-slice of 2∗t(p,B/(k/2)). Thus, a virtualqueue at a downstream egress queue at a tier-1 switch canreceive at most three packets per 2∗ t(p,B/(k/2)) time-slice,

with the first and the third packets at least t(p,B/(k/2)) timeapart (Invariant 2). Since, there are exactly k/2 virtual queues,and we schedule from those queues in a fair queue man-ner, the first packet in a virtual queue within a time-slice of2∗ t(p,B/(k/2)) is guaranteed to have been transmitted be-fore the third packet arrives, thus bounding the size of eachvirtual queue to two, and the size of egress queue to k packets.

Part II: 3-tier multi-rooted topology with full bisectionbandwidth. The data queue sizes at the tier-3 and tier-2switches in a 3-tier multi-rooted topology will have the exactsame bounds (k packets) as the tier-2 and tier-1 switches ina 2-tier multi-rooted topology respectively. This is a directconsequence of the decomposition of the 3-tier topology intomultiple parallel 2-tier topologies, as expalined in §3.3.2.

Further, the upstream egress queues at a tier-1 switch isbounded by k/2, and can be proven on the same lines as thequeue bound proof for the upstream egress queues for tier-1switches in a 2-tier topology, as described above.

Finally, the mapping of each destination end-host to a spe-cific parallel 2-tier network (Figure 6) results in the down-stream egress queue at a tier-1 switch to be mapped to a uniquedownstream egress queue at a tier-2 switch, thus sharing thesame queue bound of k.

Part III: Oversubscribed 2-tier and 3-tier multi-rootedtopologies. In §3.4, we describe how we extend LoCo’s de-sign for full bisection bandwidth topology to oversubscribedtopologies, while maintaining the same queue bounds. Thus,the queues at each switch port for an oversubscribed topologyare also bounded by k packets. ∎

A.2 Proof for Theorem 2

For a single switch network, LoCo simply runs the onlinebipartite matching algorithm [27] at the switch for admissioncontrol. §3.1.1 already explains how this results in a theoret-ical guarantee that the network utilization for any workloadwill be at least within 50% of the theoretically achievableutilization for that workload.

Next, for full bisection bandwidth 2-tier and 3-tier topolo-gies, we recursively decompose the network into multiple log-ical smaller networks (§3.3)—from 3-tier to multiple 2-tiernetworks to multiple single switch networks, each indepen-dently running the same online bipartite matching algorithm.Hence, the utilization bound for the single switch networkalso applies to 2-tier and 3-tier networks.

However, for oversubscribed network topologies, we can-not simply decompose multi-tier networks into an equivalentnetwork with multiple logical single switch networks (as wasdone with full bisection bandwidth topologies). And althoughour decomposition technique for oversubscribed topologies(§3.4) utilizes the bandwidth very effectively, and works verywell in practice (Figure 14), we do not have a theoreticalworst-case utilization bound. In fact, this remains an openproblem [3]. ∎

14

LoCo: Localizing Congestion - Cornell Universityvishal/papers/loco_2019.pdf · LoCo takes the...

Documents

Transcript of LoCo: Localizing Congestion - Cornell Universityvishal/papers/loco_2019.pdf · LoCo takes the...