Hedera: Dynamic Flow Scheduling for Data Center Networks

Mohammad Al-Fares Sivasankar RadhakrishnanBarath Raghavan* Nelson Huang Amin Vahdat

UC San Diego * Williams College

- USENIX NSDI 2010 -

Motivation

• Current data center networks support tens of thousands of machines

• Limited port-densities at the core routers Horizontal expansion = increasingly relying on multipathing

!"#$%&'($)*#$%&'($)*

+++ +++ +++

!""#$"%&'()

Motivation

• MapReduce / Hadoop -style workloads have substantial BW requirements

• Shuffle phase stresses network interconnect

• Oversubscription / Bad forwarding Jobs often bottlenecked by network

Reduce

Map Map

Reduce Reduce

Input Dataset

Output Dataset

MapReduce Workflow

Contributions

• Integration + working implementation of:

1. Centralized Data Center routing control

2. Flow Demand Estimation

3. Efficient + Fast Scheduling Heuristics

• Enables more efficient utilization of network infrastructure

• Upto 96% of optimal bisection bandwidth, > 2X better than standard techniques

Background

• Current industry standard: Equal-Cost Multi-Path (ECMP)

• Given a packet to a subnet with multiple paths, forward packet based on a hash of packet’s headers

• Originally developed as a wide-area / backbone TE tool

• Implemented in: Cisco / Juniper / HP ... etc.

Local Collision

DownstreamCollision

Core 0 Core 1 Core 2 Core 3

Flow AFlow BFlow CFlow D

Agg 1 Agg 2

Background

• ECMP drawback: Static + Oblivious to link-utilization!

• Causes long-term local/downstream flow collisions

• On 27K-host fat-tree and a randomized matrix, ECMP wastes average of 61% of bisection bandwidth!

Local Collision

DownstreamCollision

Core 0 Core 1 Core 2 Core 3

Flow AFlow BFlow CFlow D

Agg 1 Agg 2

Problem Statement

Problem:

Given a dynamic traffic matrix of flow demands, how do you find paths that maximize network bisection bandwidth?

Constraint:

Commodity Ethernet switches + No end-host mods

Problem Statement

• Single path forwarding (no flow splitting)

• Expressed as Binary Integer Programming (BIP)

• Combinatorial, NP-complete

• Exact solvers CPLEX/GLPK impractical for realistic networks

∑i=1

fi(u,v)≤ c(u,v) ∑w∈V

fi(u,w) = 0 (u �= si, ti)

∀v,u : fi(u,v) =− fi(v,u)

∑w∈V

fi(si,w) = di

∑w∈V

fi(w, ti) = di

1. Capacity Constraint 2. Flow Conserva5on 3. Demand Sa5sfac5on

MULTI-COMMODITY FLOW problem:

Problem Statement

• Polynomial-time algorithms known for 3-stage Clos Networks (based on bipartite edge-coloring)

• None for 5-stage Clos (3-tier fat-trees)

• Need to target arbitrary/general DC topologies!

∑i=1

fi(u,v)≤ c(u,v) ∑w∈V

fi(u,w) = 0 (u �= si, ti)

∀v,u : fi(u,v) =− fi(v,u)

∑w∈V

fi(si,w) = di

∑w∈V

fi(w, ti) = di

1. Capacity Constraint 2. Flow Conserva5on 3. Demand Sa5sfac5on

Architecture

• Hedera : Dynamic Flow Scheduling

• Optimize achievable bisection bandwidth by assigning flows non-conflicting paths

• Uses flow demand estimation + placement heuristics to find good flow-to-core mappings

2. EstimateFlow Demands

3. ScheduleFlows

1. DetectLarge Flows

Architecture

• Scheduler operates a tight control-loop:

1. Detect large flows

2. Estimate their bandwidth demands

3. Compute good paths and insert flow entries into switches

2. EstimateFlow Demands

3. ScheduleFlows

1. DetectLarge Flows

Elephant Detection

• Scheduler continually polls edge switches for flow byte-counts

• Flows exceeding B/s threshold are “large”

• > %10 of hosts’ link capacity in our implementation (i.e. > 100Mbps)

• What if only “small” flows ?

• Default ECMP load-balancing efficient

Elephant Detection

• Hedera complements ECMP!

• Default forwarding uses ECMP

• Hedera schedules large flows that cause bisection bandwidth problems

Demand Estimation

• Empirical measurement of flow rates are not suitable / sufficient for flow scheduling

• Current TCP flow-rates may be constrained to inefficient forwarding

• Need to find the flows’ overall fair bandwidth allocation, to better inform placement algorithms

Motivation:

Demand Estimation

• TCP’s AIMD + Fair Queueing try to achieve max-min fairness in steady state

• When routing is a degree of freedom, establishing max-min fair demands is hard

• Ideal case: find max-min fair bandwidth allocation as if constrained by host-NIC

Demand Estimation

• Given traffic matrix of large flows, modify each flow’s size at Src + Dst iteratively:

1. Sender equally distributes unconverged bandwidth among outgoing flows

2. NIC-limited receivers decrease exceeded capacity equally between incoming flows

3. Repeat until all flows converge

• Guaranteed to converge in O(|F|) time

Demand EstimationA

Flow Estimate Conv. ?

Sender Available Unconv. BW Flows Share

A 1 2 1/2

B 1 1 1

C 1 1 1

Senders

Demand Estimation

Recv RL? Non-SLFlows Share

X No - -

Y Yes 3 1/3

Receivers

A X 1/2

A Y 1/2

Demand EstimationA

A X 1/2

A Y 1/3 Yes

B Y 1/3 Yes

C Y 1/3 Yes

Sender Available Unconv. BW Flows Share

A 2/3 1 2/3

B 0 0 0

C 0 0 0

Senders

Demand EstimationA

A X 2/3 Yes

A Y 1/3 Yes

B Y 1/3 Yes

C Y 1/3 Yes

Recv RL? Non-SLFlows Share

X No - -

Y No - -

Receivers

Placement Heuristics

Global First-Fit?

Flow AFlow BFlow C

• New flow detected, linearly search all possible paths from SD

• Place flow on first path whose component links can fit that flow

0 1 2 3

Scheduler

Global First-FitFlow AFlow BFlow C

• Flows placed upon detection, are not moved

• Once flow ends, entries + reservations time out

0 1 2 3

Scheduler

Simulated Annealing

• Probabilistic search for good flow-to-core mappings

• Goal: Maximize achievable bisection bandwidth

• Current flow-to-core mapping generates neighbor state

• Calculate total exceeded bandwidth capacity

• Accept move to neighbor state if bisection BW gain

• Few thousand iterations for each scheduling round

• Avoid local-minima; non-zero prob. to worse state

Simulated Annealing

• Implemented several optimizations that reduce the search-space significantly:

• Assign a single core switch to each destination host

• Incremental calculation of exceeded capacity

• .. among others

Simulated Annealing0 1 2 3

Flow AFlow BFlow C

• Example run: 3 flows, 3 iterations

Core210

? ? ?202

Scheduler

Simulated Annealing0 1 3

Flow AFlow BFlow C

• Final state is published to the switches and used as the initial state for next round

Core ? ? ? ?203

Scheduler

Fault-Tolerance

Fault-Tolerance0 1 3

Flow AFlow BFlow C

• Link / Switch failure: Use PortLand’s fault notification protocol

• Hedera routes around failed components

Scheduler

Fault-Tolerance0 1 3

Flow AFlow BFlow C

• Scheduler failure:

• Soft-state, not required for correctness (connectivity)

• Switches fall back to ECMP

Scheduler

Implementation

Implementation• 16-host testbed

• k=4 fat-tree data-plane

• 20 machines; 4-port NetFGPAs / OpenFlow

• Parallel 48-port non-blocking Quanta switch

• 1 Scheduler machine

• Dynamic traffic monitoring

• OpenFlow routing control

Evaluation - Testbed

Stag1(.2,.3)

Stag2(.2,.3)

Stag1(.5,.3)

Stag2(.5,.3)

Stride(2

Stride(4

Stride(8

Communication Pattern

ECMP Global First-Fit Simulated Annealing Non-blocking

Evaluation - Testbed

Rand0Rand1

RandBij0

RandBij1RandX2

RandX3Hotsp

Data ShuffleECMP GFF SA Control

Total Shuffle Time (s) 438.4 335.5 336.0 306.4

Avg. Comple5on Time (s) 358.1 258.7 262.0 226.6

Avg. Bisec5on BW (Gbps) 2.81 3.89 3.84 4.44

Avg. host goodput (MB/s) 20.9 29.0 28.6 33.1

• 16-hosts: 120 GB all-to-all in-memory shuffle

• Hedera achieves 39% better bisection BW over ECMP, 88% of ideal non-blocking switch

Evaluation - Simulator

• For larger topologies:

• Models TCP’s AIMD behavior when constrained by the topology

• Stochastic flow arrival times / Bytes

• Calibrated its performance against testbed

• What about ns2 / OMNeT++ ?

• Packet-level simulators impractical at these network scales

Simulator - 8,192 hosts (k=32)

Stag(.5

Stag(.2

Stride(1

Stride(2

56)RandBij

Reactiveness• Demand Estimation:

• 27K hosts, 250K flows, converges < 200ms

• Simulated Annealing:

• Asymptotically dependent on # of flows + # iter:

• 50K flows and 10K iter: 11ms

• Most of final bisection BW: first few hundred iter

• Scheduler control loop:

• Polling + Estimation + SA = 145ms for 27K hosts

Limitations

• Dynamic workloads, large flow turnover faster than control loop

• Scheduler will be continually chasing the traffic matrix

• Need to include penalty term for unnecessary SA flow re-assignments

Flow Size

ECMP Hedera

Future Work

• Improve utility function of Simulated Annealing

• SA movement penalties (TCP)

• Add flow priorities (QoS)

• Incorporate other metrics: e.g. Power

• Release combined system: PortLand + Hedera (6/1)

• Perfect, non-centralized, per-packet Valiant Load Balancing

Conclusions

• Simulated Annealing delivers significant bisection BW gains over standard ECMP

• Hedera complements ECMP

• RPC-like traffic is fine with ECMP

• If you’re running MapReduce/Hadoop jobs on your network, you stand to benefit greatly from Hedera; tiny investment!

Questions?

http://cseweb.ucsd.edu/~malfares/

Traffic Overhead

• 27K host network:

• Polling: 72B / flow * 5 flows/host * 27K hosts / 0.1 sec = < 100MB/s for DC

• Could also use data-plane

Hedera: Dynamic Flow Scheduling for Data Center Networks · Hedera: Dynamic Flow Scheduling for...

Transcript of Hedera: Dynamic Flow Scheduling for Data Center Networks · Hedera: Dynamic Flow Scheduling for...

Hedera: Dynamic Flow Scheduling for Data Center Networks · Hedera: Dynamic Flow Scheduling for...

Documents

Transcript of Hedera: Dynamic Flow Scheduling for Data Center Networks · Hedera: Dynamic Flow Scheduling for...

Hedera: Dynamic Flow Scheduling for Data Center Networkscseweb.ucsd.edu/~vahdat/papers/hedera_nsdi10.pdf · Hedera: Dynamic Flow Scheduling for Data Center Networks Mohammad Al-Fares∗

Twenty of Years of Telugu Short Story - A collection edited by Papineni Sivasankar and Vasireddy Naveen

GRM 2013: CGIAR Research Program on Dryland Cereals -- S Sivasankar

HEDERA - Zumtobel · 2011-07-11 · HEDERA High-intensity illumination, compact design and versatile applications HEDERA creates effects and colours on indoor and outdoor surfaces

Hedera: Dynamic Flow Scheduling for Data Center Networks · Hedera: Dynamic Flow Scheduling for Data Center Networks Mohammad Al-Fares∗ Sivasankar Radhakrishnan∗ Barath Raghavan†

10-Year Pulse Research Statetegy overview - Shoba Sivasankar, Organising Author, ICRISAT, Grain Legume CRP – India

Assessment report on Hedera helix L., folium...Assessment report on Hedera helix L., folium EMA/HMPC/586887/2014 Page 6/91 Member State Medicinal Product Regulatory Status 10) 100

Hedera: Dynamic Flow Scheduling for Data Center Networks ......core routers with higher speeds and port-densities, net-works will leveragea largernumberof parallel paths be-tween any

Expectorant and Antitussive Effect of Hedera helix …...Expectorant and Antitussive Effect of Hedera helix and Rhizoma coptidis Extracts Mixture Kee Jae Song, 1 Young-June Shin, 2

Howard Rines 1 , Hedera Porter 1 , Marty Carson 2 , and Jerry Ochocki 2

Pelargonium sidoides - Hedera helix. Un Compendio de la Evidencia

Hedera: A Public Hashgraph Network & Governing Council · 2019-09-18 · HEDERA HASHGRAPH COUNCIL 6 3. GOVERNANCE - The Hedera network will be governed by a council of up to 39 leading

English Ivy (Hedera helix L.) - Mississippi State University · 2010-10-25 · English Ivy (Hedera helix L.) is a perennial trailing or climbing vine native to Europe. It has been

Hedera: A Public Hashgraph Network & Governing …whitepaper V.1.5 Last updated 19FEB 2019 suBJeCt tO Further reView & update 1 Hedera: A Public Hashgraph Network & Governing Council

Hedera: Dynamic Flow Scheduling for Data Center Networks · 2020. 4. 19. · data center network topologies are hierarchical trees with small, cheap edge switches connected to the

Invasive Species: English Ivy By: Dustin Rhodes. Maiden Name Hedera helix Hedera helix ants/noxious-weeds/weed-identification/english-

Survey and Map Distribution of English Ivy (Hedera helix L ...

Hedera: Dynamic Flow Scheduling for Data Center Networks · The recent developmentof powerfuldistributedcomput-ing frameworkssuchas MapReduce[8], Hadoop[1] and Dryad [22] as well

Hedera: A Governing Council & Public Hashgraph Network · In this paper we will examine these obstacles, and discuss why Hedera hashgraph is ideally suited to be the world’s first

CambridgeStyle Canopies Ltd · 2011-05-10 · Mobilane products. hedera . hedera . hedera . Title: Microsoft PowerPoint - CSC GREEN SCREENS 2 Author: Ben Powell Created Date: 5/10/2011