Recent Advances and What’s Next? -...

29
Recent Advances and What’s Next? Coflow Mosharaf Chowdhury University of Michigan

Transcript of Recent Advances and What’s Next? -...

Page 1: Recent Advances and What’s Next? - DIMACSdimacs.rutgers.edu/Workshops/DataCenterNetworks/Sli… ·  · 2017-06-20Recent Advances and What’s Next? Coflow Mosharaf Chowdhury ...

Recent Advances and What’s Next?Coflow

Mosharaf Chowdhury

University of Michigan

Page 2: Recent Advances and What’s Next? - DIMACSdimacs.rutgers.edu/Workshops/DataCenterNetworks/Sli… ·  · 2017-06-20Recent Advances and What’s Next? Coflow Mosharaf Chowdhury ...

Datacenter-Scale Computing

Geo-DistributedComputing

Fast AnalyticsOver the WAN

Rack-ScaleComputing

Proactive AnalyticsBefore You Think!

Coflow Networking Open Source

Apache Spark Open Source

Cluster File System Facebook

Resource Allocation Microsoft

DAG Scheduling Apache YARN

Cluster Caching Alluxio

Page 3: Recent Advances and What’s Next? - DIMACSdimacs.rutgers.edu/Workshops/DataCenterNetworks/Sli… ·  · 2017-06-20Recent Advances and What’s Next? Coflow Mosharaf Chowdhury ...

Datacenter-Scale Computing

Geo-DistributedComputing

Rack-ScaleComputing

< 0.01 ms ~ 1 ms > 100 ms

Page 4: Recent Advances and What’s Next? - DIMACSdimacs.rutgers.edu/Workshops/DataCenterNetworks/Sli… ·  · 2017-06-20Recent Advances and What’s Next? Coflow Mosharaf Chowdhury ...

Big Data

The volume of data businesses want to make sense of is increasing

Increasing variety of sources• Web, mobile, wearables, vehicles, scientific, …

Cheaper disks, SSDs, and memory

Stalling processor speeds

Page 5: Recent Advances and What’s Next? - DIMACSdimacs.rutgers.edu/Workshops/DataCenterNetworks/Sli… ·  · 2017-06-20Recent Advances and What’s Next? Coflow Mosharaf Chowdhury ...

Big Datacenters for Massive Parallelism

2005 2010 2015

MapReduce Hadoop

Spark

HiveDryad

DryadLINQ

Spark-Streaming

GraphXGraphLabPregel

Storm

Dremel

BlinkDB

1. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI’2012.

Page 6: Recent Advances and What’s Next? - DIMACSdimacs.rutgers.edu/Workshops/DataCenterNetworks/Sli… ·  · 2017-06-20Recent Advances and What’s Next? Coflow Mosharaf Chowdhury ...

Distributed Data-Parallel Applications

Multi-stage dataflow• Computation interleaved with communication

Computation Stage (e.g., Map, Reduce)• Distributed across many machines• Tasks run in parallel

Communication Stage (e.g., Shuffle)• Between successive computation stages Map Stage

Reduce Stage

A communication stage cannot complete until all the data have been transferred

Page 7: Recent Advances and What’s Next? - DIMACSdimacs.rutgers.edu/Workshops/DataCenterNetworks/Sli… ·  · 2017-06-20Recent Advances and What’s Next? Coflow Mosharaf Chowdhury ...

Communication is Crucial

Performance

As SSD-based and in-memory systems proliferate,the network is likely to become the primary bottleneck

1. Based on a month-long trace with 320,000 jobs and 150 Million tasks, collected from a 3000-machine Facebook production MapReduce cluster.

Facebook jobs spend ~25% of runtime on average in intermediate comm.1

Page 8: Recent Advances and What’s Next? - DIMACSdimacs.rutgers.edu/Workshops/DataCenterNetworks/Sli… ·  · 2017-06-20Recent Advances and What’s Next? Coflow Mosharaf Chowdhury ...

FasterCommunication

Stages:TraditionalNetworking

Approach

FlowTransfers data from a source to a destination

Independent unit of allocation, sharing, load balancing, and/orprioritization

Page 9: Recent Advances and What’s Next? - DIMACSdimacs.rutgers.edu/Workshops/DataCenterNetworks/Sli… ·  · 2017-06-20Recent Advances and What’s Next? Coflow Mosharaf Chowdhury ...

Existing Solutions

GPS RED

WFQ CSFQ

ECN XCP D2TCPDCTCP

PDQD3

FCP

DeTail pFabric

2005 2010 20151980s 1990s 2000s

RCP

Per-Flow Fairness Flow Completion Time

Independent flows cannot capture the collective communication behavior common in data-parallel applications

Page 10: Recent Advances and What’s Next? - DIMACSdimacs.rutgers.edu/Workshops/DataCenterNetworks/Sli… ·  · 2017-06-20Recent Advances and What’s Next? Coflow Mosharaf Chowdhury ...

DatacenterFabric

1

2

3

1

2

3

Why Do They Fall Short?r1 r2

s1 s2 s3

r1 r2

s1 s2 s3

Input Links Output Links

Page 11: Recent Advances and What’s Next? - DIMACSdimacs.rutgers.edu/Workshops/DataCenterNetworks/Sli… ·  · 2017-06-20Recent Advances and What’s Next? Coflow Mosharaf Chowdhury ...

Why Do They Fall Short?r1 r2

s1 s2 s3

r1 r2

s1 s2 s3Datacenter

Fabric

1

2

3

1

2

3

r1

r2

s1

s2

s3

Page 12: Recent Advances and What’s Next? - DIMACSdimacs.rutgers.edu/Workshops/DataCenterNetworks/Sli… ·  · 2017-06-20Recent Advances and What’s Next? Coflow Mosharaf Chowdhury ...

Why Do They Fall Short?

DatacenterFabric

time2 4 6

Link to r2

Link to r1

Per-Flow Fair SharingShuffle

CompletionTime = 5

Avg. FlowCompletionTime = 3.66

33

5

33

5

s1

s2

s3

r1

r2

1

2

3

1

2

3

Solutions focusing on flow completion time cannot further

decrease the shuffle completion time

Page 13: Recent Advances and What’s Next? - DIMACSdimacs.rutgers.edu/Workshops/DataCenterNetworks/Sli… ·  · 2017-06-20Recent Advances and What’s Next? Coflow Mosharaf Chowdhury ...

Improve Application-Level Performance1

DatacenterFabric

time2 4 6

Link to r2

Link to r1

Per-Flow Fair SharingShuffle

CompletionTime = 5

Avg. FlowCompletionTime = 3.66

33

5

33

5

s1

s2

s3

r1

r2

1

2

3

1

2

3

1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011.

Slow down faster flows to accelerate

slower flows

time2 4 6

Link to r2

Link to r1

Per-Flow Fair SharingShuffle

CompletionTime = 4

Avg. FlowCompletionTime = 4

444

444

Data-Proportional Allocation

Page 14: Recent Advances and What’s Next? - DIMACSdimacs.rutgers.edu/Workshops/DataCenterNetworks/Sli… ·  · 2017-06-20Recent Advances and What’s Next? Coflow Mosharaf Chowdhury ...

Communication abstraction for data-parallel applications to express their performance goalsCoflow

1. Size of each flow;2. Total number of flows;3. Endpoints of individual flows;4. Dependencies between coflows;

Page 15: Recent Advances and What’s Next? - DIMACSdimacs.rutgers.edu/Workshops/DataCenterNetworks/Sli… ·  · 2017-06-20Recent Advances and What’s Next? Coflow Mosharaf Chowdhury ...

Aggregation

Broadcast

ShuffleParallel Flows

All-to-All

Single Flow

Page 16: Recent Advances and What’s Next? - DIMACSdimacs.rutgers.edu/Workshops/DataCenterNetworks/Sli… ·  · 2017-06-20Recent Advances and What’s Next? Coflow Mosharaf Chowdhury ...

How to schedule coflows online …

… for faster#1 completion

of coflows?

… to meet#2 more

deadlines?

… for fair#3 allocation of

the network?

1

2

N

1

2

N

.

.

.

.

.

.

Datacenter

Page 17: Recent Advances and What’s Next? - DIMACSdimacs.rutgers.edu/Workshops/DataCenterNetworks/Sli… ·  · 2017-06-20Recent Advances and What’s Next? Coflow Mosharaf Chowdhury ...

Varys, Aalo & HUG

1. Coflow Scheduler Faster, application-aware data transfers throughout the network

2. Global Coordination Consistent calculation and enforcement of scheduler decisions

3. The Coflow API Decouples network optimizations from applications, relieving developers and end users

1. Efficient Coflow Scheduling with Varys, SIGCOMM’2014.2. Efficient Coflow Scheduling Without Prior Knowledge, SIGCOMM’2015.3. HUG: Multi-Resource Fairness for Correlated and Elastic Demands, NSDI’2016.

1 2 3

Page 18: Recent Advances and What’s Next? - DIMACSdimacs.rutgers.edu/Workshops/DataCenterNetworks/Sli… ·  · 2017-06-20Recent Advances and What’s Next? Coflow Mosharaf Chowdhury ...

Benefits of

time2 4 6 time2 4 6 time2 4 6

Coflow1 comp. time = 5Coflow2 comp. time = 6

Coflow1 comp. time = 5Coflow2 comp. time = 6

Fair Sharing Smallest-Flow First1,2 The Optimal

Coflow1 comp. time = 3Coflow2 comp. time = 6

L1

L2

L1

L2

L1

L2

1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012.2. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013.

Link 1

Link 2

3 Units

Coflow 1

6 Units

Coflow 2

2 Units

Inter-Coflow Scheduling

Page 19: Recent Advances and What’s Next? - DIMACSdimacs.rutgers.edu/Workshops/DataCenterNetworks/Sli… ·  · 2017-06-20Recent Advances and What’s Next? Coflow Mosharaf Chowdhury ...

Inter-Coflow Scheduling

1

2

3

1

2

3

Input Links Output Links

Datacenter

Concurrent Open Shop Scheduling with Coupled Resources• Examples include job scheduling and

caching blocks• Solutions use a ordering heuristic• Consider matching constraints

Link 1

Link 2

3 Units

Coflow 1

6 Units

Coflow 2

2 Units

3

6

2

is NP-Hard

Page 20: Recent Advances and What’s Next? - DIMACSdimacs.rutgers.edu/Workshops/DataCenterNetworks/Sli… ·  · 2017-06-20Recent Advances and What’s Next? Coflow Mosharaf Chowdhury ...

Many Problems to Solve

Aalo

VarysClairvoyant Objective

HUG

Min CCT

Min CCT

Fair CCT

Yes

No

No

Optimal

Yes

No

No

Page 21: Recent Advances and What’s Next? - DIMACSdimacs.rutgers.edu/Workshops/DataCenterNetworks/Sli… ·  · 2017-06-20Recent Advances and What’s Next? Coflow Mosharaf Chowdhury ...

Coflow-Based Architecture

Centralized master-slave architecture • Applications use a client library to

communicate with the masterActual timing and rates are determined by the coflow scheduler

Master/Coordinator

Network Interface

f Computation tasks

Local Daemon

Local Daemon

Local Daemon

CoordinationCoflow Scheduler

Page 22: Recent Advances and What’s Next? - DIMACSdimacs.rutgers.edu/Workshops/DataCenterNetworks/Sli… ·  · 2017-06-20Recent Advances and What’s Next? Coflow Mosharaf Chowdhury ...

1. CODA: Toward Automatically Identifying and Scheduling Coflows in the Dark, SIGCOMM’2016.

Coflow API

Change the applications• At the very least, we need to know

what a coflow is• For clairvoyant versions, we need

more informationChanging the framework can enabled ALL jobs to take advantage of coflows

DO NOT change the applications1

• Infer coflows from traffic network traffic patterns• Design robust coflow scheduler that

can tolerate misestimationsOur current solution only works for coflows without dependencies; we need DAG support!

Page 23: Recent Advances and What’s Next? - DIMACSdimacs.rutgers.edu/Workshops/DataCenterNetworks/Sli… ·  · 2017-06-20Recent Advances and What’s Next? Coflow Mosharaf Chowdhury ...

Performance Benefits of Using Coflows

1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’20112. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’20123. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’20134. Decentralized Task-Aware Scheduling for Data Center Networks, SIGCOMM’2014

1.003.21

5.65 5.53

22.07

1.10

0

5

10

15

20

25

Varys Fair FIFO Priority FIFO-LM NC

Ove

rhea

dO

ver V

arys

Varys Aalo1 4Per-FlowFairness

Per-FlowPrioritization

2,3

Lower is Better

Page 24: Recent Advances and What’s Next? - DIMACSdimacs.rutgers.edu/Workshops/DataCenterNetworks/Sli… ·  · 2017-06-20Recent Advances and What’s Next? Coflow Mosharaf Chowdhury ...

The Need for Coordination

8

17

115

495 99

2

1

10

100

1000

100

1000

1000

0

5000

0

1000

00Ave

rage

Coo

rdin

atio

n T

ime

(ms)

# (Emulated) Aalo Slaves

Coordination is necessary to determine realtime

• Coflow size (sum);• Coflow rates (max);• Partial order of coflows (ordering);

Can be a large source of overhead• Does not impact too much for large

coflows in slow networks, but …How to perform decentralized coflow scheduling?

Page 25: Recent Advances and What’s Next? - DIMACSdimacs.rutgers.edu/Workshops/DataCenterNetworks/Sli… ·  · 2017-06-20Recent Advances and What’s Next? Coflow Mosharaf Chowdhury ...

Coflow-Aware Load Balancing

Especially useful in asymmetric topologies• For example, in the presence of switch or link failures

Provides an additional degree of freedom• During path selection• For dynamically determining load balancing granularity

Increased need for coordination, but at an even higher cost

Page 26: Recent Advances and What’s Next? - DIMACSdimacs.rutgers.edu/Workshops/DataCenterNetworks/Sli… ·  · 2017-06-20Recent Advances and What’s Next? Coflow Mosharaf Chowdhury ...

Coflow-Aware Routing

Relevant in topologies w/o full bisection bandwidth• When topologies have temporary in-network oversubscriptions• In geo-distributed analytics

Scheduling-only solutions do not work well• Calls for routing-scheduling joint solutions• Must take network utilization into account• Must avoid frequent path changes

Increased need for coordination

Page 27: Recent Advances and What’s Next? - DIMACSdimacs.rutgers.edu/Workshops/DataCenterNetworks/Sli… ·  · 2017-06-20Recent Advances and What’s Next? Coflow Mosharaf Chowdhury ...

Coflows in Circuit-Switched Networks

Circuit switching is relevant again due to the rise of optical networks• Provides very high bandwidth• Expensive to setup new circuits

Co-scheduling applications and coflows• Schedule tasks so that we can reuse already-setup circuits• Perform in-network aggregation using existing circuits instead of waiting for new

circuits to be created

Page 28: Recent Advances and What’s Next? - DIMACSdimacs.rutgers.edu/Workshops/DataCenterNetworks/Sli… ·  · 2017-06-20Recent Advances and What’s Next? Coflow Mosharaf Chowdhury ...

Extension to Multiple Resources1

A DAG of coflows is very similar to a job DAG of stages

• Same principle applies, but with new challenges

Consider both fungible (b/w) and non-fungible resources (cores)

• Across the entire DAG

1. Altruistic Scheduling in Multi-Resource Clusters, OSDI2016.

Page 29: Recent Advances and What’s Next? - DIMACSdimacs.rutgers.edu/Workshops/DataCenterNetworks/Sli… ·  · 2017-06-20Recent Advances and What’s Next? Coflow Mosharaf Chowdhury ...

Communication abstraction for data-parallel applications to express their performance goalsCoflow

Key open challenges1. Better theoretical understanding2. Efficient solutions to deal with decentralization, topologies, multi-resource

settings, estimations over DAG, circuit-switching, etc.More information

1. Papers: http://www.mosharaf.com/publications/2. Software/simulator/workloads: https://github.com/coflow