Coflow - Electrical Engineering and Computer Science
Transcript of Coflow - Electrical Engineering and Computer Science
CoflowRecent Advances and What’s Next?
Mosharaf Chowdhury
Big Data
The volume of data businesses want to make sense of is increasing
Increasing variety of sources• Web, mobile, wearables, vehicles, scientific, …
Cheaper disks, SSDs, and memory
Stalling processor speeds
Big Datacenters for Massive Parallelism
2005 2010 2015
MapReduce Hadoop
Spark
HiveDryad
DryadLINQ
Spark-Streaming
GraphXGraphLabPregel
Storm
Dremel
BlinkDB TensorFlow
Data-Parallel Applications
Multi-stage dataflow• Computation interleaved with communication
Computation Stage (e.g., Map, Reduce)• Distributed across many machines• Tasks run in parallel
Communication Stage (e.g., Shuffle)• Between successive computation stages Map Stage
Reduce Stage
A communication stage cannot complete until all the data have been transferred
Communication is Crucial
Performance
As SSD-based and in-memory systems proliferate,the network is likely to become the primary bottleneck
1. Based on a month-long trace with 320,000 jobs and 150 Million tasks, collected from a 3000-machine Facebook production MapReduce cluster.
Facebook jobs spend ~25% of runtime on average in intermediate comm.1
FasterCommunication
Stages:Networking
Approach“Configuration should be handled
at the system level”
FlowTransfers data from a source to a destination
Independent unit of allocation, sharing, load balancing, and/orprioritization
Existing Solutions
GPS RED
WFQ CSFQ
ECN XCP D2TCPDCTCP
PDQD3
FCP
DeTail pFabric
2005 2010 20151980s 1990s 2000s
RCP
Per-Flow Fairness Flow Completion Time
Independent flows cannot capture the collective communication behavior common in data-parallel applications
DatacenterNetwork
1
2
3
1
2
3
Why Do They Fall Short?r1 r2
s1 s2 s3
r1 r2
s1 s2 s3
Input Links Output Links
Why Do They Fall Short?r1 r2
s1 s2 s3
r1 r2
s1 s2 s3DatacenterNetwork
1
2
3
1
2
3
r1
r2
s1
s2
s3
Why Do They Fall Short?
DatacenterNetwork
time2 4 6
Link to r2
Link to r1
Per-Flow Fair SharingShuffle
CompletionTime = 5
Avg. FlowCompletionTime = 3.66
33
5
33
5
s1
s2
s3
r1
r2
1
2
3
1
2
3
Solutions focusing on flow completion time cannot further
decrease the shuffle completion time
Improve Application-Level Performance1
DatacenterNetwork
time2 4 6
Link to r2
Link to r1
Per-Flow Fair SharingShuffle
CompletionTime = 5
Avg. FlowCompletionTime = 3.66
33
5
33
5
s1
s2
s3
r1
r2
1
2
3
1
2
3
1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011.
Slow down faster flows to accelerate
slower flows
time2 4 6
Link to r2
Link to r1
Per-Flow Fair SharingShuffle
CompletionTime = 4
Avg. FlowCompletionTime = 4
444
444
Data-Proportional Allocation
Coflow Communication abstraction for data-parallel applications to express their performance goals
1. Minimize completion times,2. Meet deadlines, or 3. Perform fair allocation.
Aggregation
Broadcast
ShuffleParallel Flows
All-to-All
Single Flow
How to schedule coflows online …
… for faster#1 completion
of coflows?
… to meet#2 more
deadlines?
… for fair#3 allocation of
the network?
1
2
N
1
2
N
.
.
.
.
.
.
Datacenter
Varys Enables coflows in data-intensive clusters
1. Coflow Scheduler Faster, application-aware data transfers throughout the network
2. Global Coordination Consistent calculation and enforcement of scheduler decisions
3. The Coflow API Decouples network optimizations from applications, relieving developers and end users
1. Efficient Coflow Scheduling with Varys, SIGCOMM’2014.
1Communication abstraction for data-parallel applications to express their performance goalsCoflow
1. The size of each flow, 2. The total number of flows, and 3. The endpoints of individual flows.
Benefits of
time2 4 6 time2 4 6 time2 4 6
Coflow1 comp. time = 5Coflow2 comp. time = 6
Coflow1 comp. time = 5Coflow2 comp. time = 6
Fair Sharing Smallest-Flow First1,2 The Optimal
Coflow1 comp. time = 3Coflow2 comp. time = 6
L1
L2
L1
L2
L1
L2
1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012.2. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013.
Link 1
Link 2
3 Units
Coflow 1
6 Units
Coflow 2
2 Units
Inter-Coflow Scheduling
time2 4 6
Coflow1 comp. time = 6Coflow2 comp. time = 6
Fair Sharing
L1
L2
time2 4 6
Coflow1 comp. time = 6Coflow2 comp. time = 6
Flow-level Prioritization1
L1
L2
time2 4 6
The Optimal
Coflow1 comp. time = 3Coflow2 comp. time = 6
L1
L2
Concurrent Open Shop Scheduling1
• Examples include job scheduling and caching blocks
• Solutions use a ordering heuristic
Link 1
Link 2
3 Units
Coflow 1
6 Units
Coflow 2
2 Units
1. A Note on the Complexity of the Concurrent Open Shop Problem, Journal of Scheduling, 9(4):389–396, 20061. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012.2. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013.
Inter-Coflow Scheduling is NP-Hard
Inter-Coflow Scheduling
1
2
3
1
2
3
Input Links Output Links
Datacenter
Concurrent Open Shop Scheduling with Coupled Resources• Examples include job scheduling and
caching blocks• Solutions use a ordering heuristic• Consider matching constraints
Link 1
Link 2
3 Units
Coflow 1
6 Units
Coflow 2
2 Units
3
6
2
is NP-Hard
Varys Employs a two-step algorithm to minimize coflow completion times
1. Ordering heuristic Keep an ordered list of coflows to be scheduled, preempting if needed
2. Allocation algorithm Allocates minimum required resources to each coflow to finish in minimum time
Ordering Heuristic
Datacenter13
O3
O2
Time
O1
C2 endsC1 ends
3
C1 C2 C3
Length 3 5 6
Shortest-First19
C3 ends
(Total CCT = 35)
1
2
3
1
2
3
3
3
3
6
5
5
Ordering Heuristic
Datacenter
1
2
3
1
2
313
O3
O2
Time
O1
C2 endsC1 ends
3
Shortest-First19
C3 ends
16
O3
O2
Time
O1
C2 ends C1 ends
19
Narrowest-First
6
C3 ends
C1 C2 C3
Width 3 2 1
3
3
3
6
5
5
(Total CCT = 41)(35)
Ordering Heuristic
Datacenter
1
2
3
1
2
316
O3
O2
Time
O1
C2 ends C1 ends
19
Narrowest-First
6
C3 ends
19
O3
O2
Time
O1
C2 endsC1 ends
9
Smallest-First
6
C3 endsC1 C2 C3
Size 9 10 6
3
3
3
6
5
5
(41)
(34)
13
O3
O2
Time
O1
C2 endsC1 ends
3
Shortest-First19
C3 ends
(35)
Ordering Heuristic
Datacenter
1
2
3
1
2
3
19
O3
O2
Time
O1
C2 endsC1 ends
9
Smallest-First
6
C3 ends
19
O3
O2
Time
O1
C2 endsC1 ends
3
Smallest-Bottleneck9
C3 endsC1 C2 C3
Bottleneck 3 10 6
3
3
3
6
5
5
(34) (31)
16
O3
O2
Time
O1
C2 ends C1 ends
19
Narrowest-First
6
C3 ends
(41)13
O3
O2
Time
O1
C2 endsC1 ends
3
Shortest-First19
C3 ends
(35)
Allocation Algorithm
A coflow cannot finish before its very last flow
Finishing flows faster than the bottleneck cannot decrease a coflow’s completion time
Allocate minimum flow rates such that all flows of a coflow finish together on time
Varys Enables coflows in data-intensive clusters
1. Coflow Scheduler Faster, application-aware data transfers throughout the network
2. Global Coordination Consistent calculation and enforcement of scheduler decisions
3. The Coflow API Decouples network optimizations from applications, relieving developers and end users
The Need for Coordination
1
2
3
1
2
3
4
3O3
O2
Time
O1
C2 endsC1 ends
4
C1 C2
Bottleneck 4 5
Schedulingwith
Coordination
54 9
(Total CCT = 13)
The Need for Coordination
1
2
3
1
2
3
O3
O2
Time
O1
C2 endsC1 ends
4
Schedulingwith
Coordination
9
O3
O2
Time
O1
C2 endsC1 ends
7
Schedulingwithout
Coordination
12
Uncoordinated local decisions interleave coflows, hurting performance
4
3
54
(Total CCT = 13) (Total CCT = 19)
Varys Architecture
Centralized master-slave architecture • Applications use a client library to
communicate with the masterActual timing and rates are determined by the coflow scheduler
Put Get Reg
Varys Master
Coflow Scheduler
TopologyMonitor
UsageEstimator
Network Interface
(Distributed) File System
fComp. Tasks callingVarys Client Library
TaskName
Sender Receiver Driver
Varys Daemon
Varys Daemon
Varys Daemon
1. Download from http://varys.net
Varys Enables coflows in data-intensive clusters
1. Coflow Scheduler Faster, application-aware data transfers throughout the network
2. Global Coordination Consistent calculation and enforcement of scheduler decisions
3. The Coflow API Decouples network optimizations from applications, relieving developers and end users
The Coflow API
• register
• put
• get
• unregistermappers
reducers
shuf
fle
driver (JobTracker)
broa
dcas
t
@mapperb.get(id)…
@reducers.get(idsl)…
@driverb register(BROADCAST)
id b.put(content)…
ids1 s.put(content)…
s.unregister()b.unregister()
1. NO changes to user jobs2. NO storage management
s register(SHUFFLE)
1. Does it improve performance?2. Can it beat non-preemptive solutions?3. Do we really need coordination? YES
EvaluationA 3000-machine trace-driven simulation matched against a 100-machine EC2 deployment
Better than Per-Flow Fairness
Sim.
EC2 1.85X 1.25X
3.21X 1.11X
Comm. Improv. Job Improv.
2.50X3.16X
3.39X4.86X
Comm. HeavyBetter than Per-Flow Fairness
Sim.
EC2 1.85X 1.25X
3.21X 1.11X
Comm. Improv. Job Improv.
Preemption is Necessary [Sim.]
1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011
1.003.21
5.65 5.53
22.07
1.10
0
5
10
15
20
25
Varys Fair FIFO Priority FIFO-LM NC
Ove
rhea
dO
ver V
arys
Varys Varys1 4Per-FlowFairness
Per-FlowPrioritization
2,3
WhatAbout
Starvation
NO
Preemption is Necessary [Sim.]
1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011
1.003.21
5.65 5.53
22.07
1.10
0
5
10
15
20
25
Varys Fair FIFO Priority FIFO-LM NC
Ove
rhea
dO
ver V
arys
Varys Varys1 4Per-FlowFairness
Per-FlowPrioritization
2,3
WhatAbout
Starvation
NO
Lack of Coordination Hurts [Sim.]
1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’20112. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’20123. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’20134. Decentralized Task-Aware Scheduling for Data Center Networks, SIGCOMM’2014
Smallest-flow-first (per-flow priorities)• Minimizes flow completion time
FIFO-LM4 performs decentralized coflow scheduling
• Suffers due to local decisions• Works well for small, similar coflows
1.003.21
5.65 5.53
22.07
1.10
0
5
10
15
20
25
Varys Fair FIFO Priority FIFO-LM NC
Ove
rhea
dO
ver V
arys
Varys Varys1 4Per-FlowFairness
Per-FlowPrioritization
2,3
Coflow
1. The size of each flow, 2. The total number of flows, and 3. The endpoints of individual flows.
! Pipelining between stages! Speculative executions! Task failures and restarts
Communication abstraction for data-parallel applications to express their performance goals
How to Perform Coflow Scheduling Without Complete Knowledge?
1. Current size is a good predictor of actual size 2. Set priority that decreases by how much a coflow has sent3. Discretize priority levels to blend in FIFO within each level
Aalo Efficiently schedules coflows withoutcomplete and future information
1. Efficient Coflow Scheduling Without Prior Knowledge, SIGCOMM’2015
1
How to Perform Coflow Scheduling Without Changing the Applications?
1. Learn coflows online from traffic patterns2. Error-tolerant scheduling to survive learning errors3. Limited to jobs with single coflows
CODA Efficiently schedules coflows withoutchanging applications*
1. CODA: Toward Automatically Identifying and Scheduling Coflows in the Dark, SIGCOMM’2016
1
What About Fair Coflow Scheduling?
1. Multi-resource fairness with high utilization2. Fairness-utilization tradeoff results in prisoner’s dilemma
HUG Fairly schedules coflows instead of trying to minimize CCT
1. HUG: Multi-Resource Fairness for Correlated and Elastic Demands, NSDI’2016
1
NetworkingSystems MIND THE GAPME
Better capture application-level performance goals using coflows
Coflows improve application-level performance and usability• Extends networking and scheduling literature
Coordination – even if not free – is worth paying for in many cases
[email protected]://www.mosharaf.com/
Improve Flow Completion Times
Datacenter
1
2
3
1
2
3
r1
r2
s1
s2
s3
time2 4 6
Link to r2
Link to r1
Per-Flow Fair SharingShuffle
CompletionTime = 5
Avg. FlowCompletionTime = 3.66
1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012.2. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013.
Smallest-Flow First1,2
time2 4 6
Link to r2
Link to r1
ShuffleCompletionTime = 6
Avg. FlowCompletionTime = 2.66
1
1 6
42
2
00.20.40.60.8
1
1.E+06 1.E+08 1.E+10 1.E+12 1.E+14
Frac
. of C
oflo
ws
Coflow Length (Bytes)
00.20.40.60.8
1
1.E+00 1.E+04 1.E+08
Frac
. of C
oflo
ws
Coflow Width (Number of Flows)
00.20.40.60.8
1
1.E+06 1.E+08 1.E+10 1.E+12 1.E+14
Frac
. of C
oflo
ws
Coflow Size (Bytes)
00.20.40.60.8
1
1.E+06 1.E+08 1.E+10 1.E+12 1.E+14
Frac
. of C
oflo
ws
Coflow Bottleneck Size (Bytes)
Distributions of Coflow Characteristics
Traffic Sources1. Ingest and replicate new data2. Read input from remote machines,
when needed3. Transfer intermediate data4. Write and replicate output
30
1446
10Percentage of Traffic by Category atFacebook
Distribution of Shuffle Durations
PerformanceFacebook jobs spend ~25% of runtime on average in intermediate comm.
Month-long trace from a 3000-machine MapReduce production cluster at Facebook
320,000 jobs150 Million tasks
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Frac
tion
of Jo
bs
Fraction of Runtime Spent in Shuffle
Theoretical Results
Structure of optimal schedules• Permutation schedules might not always lead to the optimal solution
Approximation ratio of COSS-CR• Polynomial-time algorithm with constant approximation ratio ( )1
The need for coordination• Fully decentralized schedulers can perform arbitrarily worse than the optimal
1. Due to Zhen Qiu, Cliff Stein, and Yuan Zhong from the Department of Industrial Engineering and Operations Research, Columbia University, 2014
643
1. Admission control Do not admit any coflows that cannot be completed within deadline without violating existing deadlines
2. Allocation algorithm Allocate minimum required resources to each coflow to finish them at their deadlines
Varys Employs a two-step algorithm to support coflow deadlines
More PredictableEC2 DeploymentFacebook Trace Simulation
Met Deadline Not Admitted Missed Deadline
1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012
0
25
50
75
100
Varys Fair EDF
% o
f Cof
low
s
Varys 1
(Earliest-Deadline First)
0
25
50
75
100
Varys Fair
% o
f Cof
low
s
Varys
Experimental Methodology
Varys deployment in EC2 • 100 m2.4xlarge machines• Each machine has 8 CPU cores, 68.4 GB memory, and 1 Gbps NIC• ~900 Mbps/machine during all-to-all communication
Trace-driven simulation• Detailed replay of a day-long Facebook trace (circa October 2010)• 3000-machine,150-rack cluster with 10:1 oversubscription