Coflow - Electrical Engineering and Computer Science

CoflowRecent Advances and What’s Next?

Mosharaf Chowdhury

Big Data

The volume of data businesses want to make sense of is increasing

Increasing variety of sources• Web, mobile, wearables, vehicles, scientific, …

Cheaper disks, SSDs, and memory

Stalling processor speeds

Big Datacenters for Massive Parallelism

2005 2010 2015

MapReduce Hadoop

Spark

HiveDryad

DryadLINQ

Spark-Streaming

GraphXGraphLabPregel

Storm

Dremel

BlinkDB TensorFlow

Data-Parallel Applications

Multi-stage dataflow• Computation interleaved with communication

Computation Stage (e.g., Map, Reduce)• Distributed across many machines• Tasks run in parallel

Communication Stage (e.g., Shuffle)• Between successive computation stages Map Stage

Reduce Stage

A communication stage cannot complete until all the data have been transferred

Communication is Crucial

Performance

As SSD-based and in-memory systems proliferate,the network is likely to become the primary bottleneck

1. Based on a month-long trace with 320,000 jobs and 150 Million tasks, collected from a 3000-machine Facebook production MapReduce cluster.

Facebook jobs spend ~25% of runtime on average in intermediate comm.1

FasterCommunication

Stages:Networking

Approach“Configuration should be handled

at the system level”

FlowTransfers data from a source to a destination

Independent unit of allocation, sharing, load balancing, and/orprioritization

Existing Solutions

GPS RED

WFQ CSFQ

ECN XCP D2TCPDCTCP

PDQD3

FCP

DeTail pFabric

2005 2010 20151980s 1990s 2000s

RCP

Per-Flow Fairness Flow Completion Time

Independent flows cannot capture the collective communication behavior common in data-parallel applications

DatacenterNetwork

1

2

3

1

2

3

Why Do They Fall Short?r1 r2

s1 s2 s3

r1 r2

s1 s2 s3

Input Links Output Links

Why Do They Fall Short?r1 r2

s1 s2 s3

r1 r2

s1 s2 s3DatacenterNetwork

1

2

3

1

2

3

r1

r2

s1

s2

s3

Why Do They Fall Short?

DatacenterNetwork

time2 4 6

Link to r2

Link to r1

Per-Flow Fair SharingShuffle

CompletionTime = 5

Avg. FlowCompletionTime = 3.66

33

5

33

5

s1

s2

s3

r1

r2

1

2

3

1

2

3

Solutions focusing on flow completion time cannot further

decrease the shuffle completion time

Improve Application-Level Performance1

DatacenterNetwork

time2 4 6

Link to r2

Link to r1


CompletionTime = 5


33

5

33

5

s1

s2

s3

r1

r2

1

2

3

1

2

3

1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011.

Slow down faster flows to accelerate

slower flows

time2 4 6

Link to r2

Link to r1


CompletionTime = 4

Avg. FlowCompletionTime = 4

444

444

Data-Proportional Allocation

Coflow Communication abstraction for data-parallel applications to express their performance goals

1. Minimize completion times,2. Meet deadlines, or 3. Perform fair allocation.

Aggregation

Broadcast

ShuffleParallel Flows

All-to-All

Single Flow

How to schedule coflows online …

… for faster#1 completion

of coflows?

… to meet#2 more

deadlines?

… for fair#3 allocation of

the network?

1

2

N

1

2

N

.

.

.

.

.

.

Datacenter

Varys Enables coflows in data-intensive clusters

1. Coflow Scheduler Faster, application-aware data transfers throughout the network

2. Global Coordination Consistent calculation and enforcement of scheduler decisions

3. The Coflow API Decouples network optimizations from applications, relieving developers and end users

1. Efficient Coflow Scheduling with Varys, SIGCOMM’2014.

1Communication abstraction for data-parallel applications to express their performance goalsCoflow

1. The size of each flow, 2. The total number of flows, and 3. The endpoints of individual flows.

Benefits of

time2 4 6 time2 4 6 time2 4 6

Coflow1 comp. time = 5Coflow2 comp. time = 6


Fair Sharing Smallest-Flow First1,2 The Optimal


L1

L2

L1

L2

L1

L2

1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012.2. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013.

Link 1

Link 2

3 Units

Coflow 1

6 Units

Coflow 2

2 Units

Inter-Coflow Scheduling

time2 4 6


Fair Sharing

L1

L2

time2 4 6


Flow-level Prioritization1

L1

L2

time2 4 6

The Optimal


L1

L2

Concurrent Open Shop Scheduling1

• Examples include job scheduling and caching blocks

• Solutions use a ordering heuristic

Link 1

Link 2

3 Units

Coflow 1

6 Units

Coflow 2

2 Units

1. A Note on the Complexity of the Concurrent Open Shop Problem, Journal of Scheduling, 9(4):389–396, 20061. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012.2. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013.

Inter-Coflow Scheduling is NP-Hard

Inter-Coflow Scheduling

1

2

3

1

2

3

Input Links Output Links

Datacenter

Concurrent Open Shop Scheduling with Coupled Resources• Examples include job scheduling and

caching blocks• Solutions use a ordering heuristic• Consider matching constraints

Link 1

Link 2

3 Units

Coflow 1

6 Units

Coflow 2

2 Units

3

6

2

is NP-Hard

Varys Employs a two-step algorithm to minimize coflow completion times

1. Ordering heuristic Keep an ordered list of coflows to be scheduled, preempting if needed

2. Allocation algorithm Allocates minimum required resources to each coflow to finish in minimum time

Ordering Heuristic

Datacenter13

O3

O2

Time

O1

C2 endsC1 ends

3

C1 C2 C3

Length 3 5 6

Shortest-First19

C3 ends

(Total CCT = 35)

1

2

3

1

2

3

3

3

3

6

5

5

Ordering Heuristic

Datacenter

1

2

3

1

2

313

O3

O2

Time

O1

C2 endsC1 ends

3

Shortest-First19

C3 ends

16

O3

O2

Time

O1

C2 ends C1 ends

19

Narrowest-First

6

C3 ends

C1 C2 C3

Width 3 2 1

3

3

3

6

5

5

(Total CCT = 41)(35)

Ordering Heuristic

Datacenter

1

2

3

1

2

316

O3

O2

Time

O1

C2 ends C1 ends

19

Narrowest-First

6

C3 ends

19

O3

O2

Time

O1

C2 endsC1 ends

9

Smallest-First

6

C3 endsC1 C2 C3

Size 9 10 6

3

3

3

6

5

5

(41)

(34)

13

O3

O2

Time

O1

C2 endsC1 ends

3

Shortest-First19

C3 ends

(35)

Ordering Heuristic

Datacenter

1

2

3

1

2

3

19

O3

O2

Time

O1

C2 endsC1 ends

9

Smallest-First

6

C3 ends

19

O3

O2

Time

O1

C2 endsC1 ends

3

Smallest-Bottleneck9

C3 endsC1 C2 C3

Bottleneck 3 10 6

3

3

3

6

5

5

(34) (31)

16

O3

O2

Time

O1

C2 ends C1 ends

19

Narrowest-First

6

C3 ends

(41)13

O3

O2

Time

O1

C2 endsC1 ends

3

Shortest-First19

C3 ends

(35)

Allocation Algorithm

A coflow cannot finish before its very last flow

Finishing flows faster than the bottleneck cannot decrease a coflow’s completion time

Allocate minimum flow rates such that all flows of a coflow finish together on time





The Need for Coordination

1

2

3

1

2

3

4

3O3

O2

Time

O1

C2 endsC1 ends

4

C1 C2

Bottleneck 4 5

Schedulingwith

Coordination

54 9

(Total CCT = 13)

The Need for Coordination

1

2

3

1

2

3

O3

O2

Time

O1

C2 endsC1 ends

4

Schedulingwith

Coordination

9

O3

O2

Time

O1

C2 endsC1 ends

7

Schedulingwithout

Coordination

12

Uncoordinated local decisions interleave coflows, hurting performance

4

3

54

(Total CCT = 13) (Total CCT = 19)

Varys Architecture

Centralized master-slave architecture • Applications use a client library to

communicate with the masterActual timing and rates are determined by the coflow scheduler

Put Get Reg

Varys Master

Coflow Scheduler

TopologyMonitor

UsageEstimator

Network Interface

(Distributed) File System

fComp. Tasks callingVarys Client Library

TaskName

Sender Receiver Driver

Varys Daemon

Varys Daemon

Varys Daemon

1. Download from http://varys.net





The Coflow API

• register

• put

• get

• unregistermappers

reducers

shuf

fle

driver (JobTracker)

broa

dcas

t

@mapperb.get(id)…

@reducers.get(idsl)…

@driverb register(BROADCAST)

id b.put(content)…

ids1 s.put(content)…

s.unregister()b.unregister()

1. NO changes to user jobs2. NO storage management

s register(SHUFFLE)

1. Does it improve performance?2. Can it beat non-preemptive solutions?3. Do we really need coordination? YES

EvaluationA 3000-machine trace-driven simulation matched against a 100-machine EC2 deployment

Better than Per-Flow Fairness

Sim.

EC2 1.85X 1.25X

3.21X 1.11X

Comm. Improv. Job Improv.

2.50X3.16X

3.39X4.86X

Comm. HeavyBetter than Per-Flow Fairness

Sim.

EC2 1.85X 1.25X

3.21X 1.11X

Comm. Improv. Job Improv.

Preemption is Necessary [Sim.]

1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011

1.003.21

5.65 5.53

22.07

1.10

0

5

10

15

20

25

Varys Fair FIFO Priority FIFO-LM NC

Ove

rhea

dO

ver V

arys

Varys Varys1 4Per-FlowFairness

Per-FlowPrioritization

2,3

WhatAbout

Starvation

NO

Preemption is Necessary [Sim.]

1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011

1.003.21

5.65 5.53

22.07

1.10

0

5

10

15

20

25


Ove

rhea

dO

ver V

arys



2,3

WhatAbout

Starvation

NO

Lack of Coordination Hurts [Sim.]

1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’20112. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’20123. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’20134. Decentralized Task-Aware Scheduling for Data Center Networks, SIGCOMM’2014

Smallest-flow-first (per-flow priorities)• Minimizes flow completion time

FIFO-LM4 performs decentralized coflow scheduling

• Suffers due to local decisions• Works well for small, similar coflows

1.003.21

5.65 5.53

22.07

1.10

0

5

10

15

20

25


Ove

rhea

dO

ver V

arys



2,3

Coflow

1. The size of each flow, 2. The total number of flows, and 3. The endpoints of individual flows.

! Pipelining between stages! Speculative executions! Task failures and restarts

Communication abstraction for data-parallel applications to express their performance goals

How to Perform Coflow Scheduling Without Complete Knowledge?

1. Current size is a good predictor of actual size 2. Set priority that decreases by how much a coflow has sent3. Discretize priority levels to blend in FIFO within each level

Aalo Efficiently schedules coflows withoutcomplete and future information

1. Efficient Coflow Scheduling Without Prior Knowledge, SIGCOMM’2015

1

How to Perform Coflow Scheduling Without Changing the Applications?

1. Learn coflows online from traffic patterns2. Error-tolerant scheduling to survive learning errors3. Limited to jobs with single coflows

CODA Efficiently schedules coflows withoutchanging applications*

1. CODA: Toward Automatically Identifying and Scheduling Coflows in the Dark, SIGCOMM’2016

1

What About Fair Coflow Scheduling?

1. Multi-resource fairness with high utilization2. Fairness-utilization tradeoff results in prisoner’s dilemma

HUG Fairly schedules coflows instead of trying to minimize CCT

1. HUG: Multi-Resource Fairness for Correlated and Elastic Demands, NSDI’2016

1

NetworkingSystems MIND THE GAPME

Better capture application-level performance goals using coflows

Coflows improve application-level performance and usability• Extends networking and scheduling literature

Coordination – even if not free – is worth paying for in many cases

[email protected]://www.mosharaf.com/

Improve Flow Completion Times

Datacenter

1

2

3

1

2

3

r1

r2

s1

s2

s3

time2 4 6

Link to r2

Link to r1


CompletionTime = 5


1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012.2. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013.

Smallest-Flow First1,2

time2 4 6

Link to r2

Link to r1

ShuffleCompletionTime = 6


1

1 6

42

2

00.20.40.60.8

1

1.E+06 1.E+08 1.E+10 1.E+12 1.E+14

Frac

. of C

oflo

ws

Coflow Length (Bytes)

00.20.40.60.8

1

1.E+00 1.E+04 1.E+08

Frac

. of C

oflo

ws

Coflow Width (Number of Flows)

00.20.40.60.8

1

1.E+06 1.E+08 1.E+10 1.E+12 1.E+14

Frac

. of C

oflo

ws

Coflow Size (Bytes)

00.20.40.60.8

1

1.E+06 1.E+08 1.E+10 1.E+12 1.E+14

Frac

. of C

oflo

ws

Coflow Bottleneck Size (Bytes)

Distributions of Coflow Characteristics

Traffic Sources1. Ingest and replicate new data2. Read input from remote machines,

when needed3. Transfer intermediate data4. Write and replicate output

30

1446

10Percentage of Traffic by Category atFacebook

Distribution of Shuffle Durations

PerformanceFacebook jobs spend ~25% of runtime on average in intermediate comm.

Month-long trace from a 3000-machine MapReduce production cluster at Facebook

320,000 jobs150 Million tasks

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Frac

tion

of Jo

bs

Fraction of Runtime Spent in Shuffle

Theoretical Results

Structure of optimal schedules• Permutation schedules might not always lead to the optimal solution

Approximation ratio of COSS-CR• Polynomial-time algorithm with constant approximation ratio ( )1

The need for coordination• Fully decentralized schedulers can perform arbitrarily worse than the optimal

1. Due to Zhen Qiu, Cliff Stein, and Yuan Zhong from the Department of Industrial Engineering and Operations Research, Columbia University, 2014

643

1. Admission control Do not admit any coflows that cannot be completed within deadline without violating existing deadlines

2. Allocation algorithm Allocate minimum required resources to each coflow to finish them at their deadlines

Varys Employs a two-step algorithm to support coflow deadlines

More PredictableEC2 DeploymentFacebook Trace Simulation

Met Deadline Not Admitted Missed Deadline

1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012

0

25

50

75

100

Varys Fair EDF

% o

f Cof

low

s

Varys 1

(Earliest-Deadline First)

0

25

50

75

100

Varys Fair

% o

f Cof

low

s

Varys

Experimental Methodology

Varys deployment in EC2 • 100 m2.4xlarge machines• Each machine has 8 CPU cores, 68.4 GB memory, and 1 Gbps NIC• ~900 Mbps/machine during all-to-all communication

Trace-driven simulation• Detailed replay of a day-long Facebook trace (circa October 2010)• 3000-machine,150-rack cluster with 10:1 oversubscription

Coflow - Electrical Engineering and Computer Science

Documents

Transcript of Coflow - Electrical Engineering and Computer Science