Optimizing Shuffle in Wide-Area Data Analytics...Wide-Area Data Analytics 3 N. California Ireland...
Transcript of Optimizing Shuffle in Wide-Area Data Analytics...Wide-Area Data Analytics 3 N. California Ireland...
Optimizing Shuffle inWide-Area Data AnalyticsShuhao Liu*, Hao Wang, Baochun Li Department of Electrical & Computer Engineering University of Toronto
What is: - Wide-Area Data Analytics? - Shuffle?
�2
Wide-Area Data Analytics
�3
N. California
Ireland
Singapore
Oregon
N. Virginiadata 1
data 2
data 3
Large volumes of data are generated, stored and processed across geographically distributed DCs.
Existing work focuses on Task Placement
�4
Rethink the root cause of inter-DC traffic: Shuffle
�5
Fetch-based Shuffle
�6
Mapper 1
Mapper 2
Mapper 3
Reducer 1
Reducer 3
Reducer 2
All-to-allCommunicationat Beginning of Reduce Tasks
Problems with Fetch‣ Under-utilize the inter-datacenter bandwidth
‣ Start late: beginning of reduce
‣ Start concurrently: share bandwidth
‣ Need for refetch
‣ Possible reduce task failure
�7
Push-based Shuffle‣ Bandwidth Utilization
�8
time0 4 8 12 16
time0 4 8 12 16
Map Reduce
Reduce
Map Data Transfer
Data Transfer
Reduce
ReduceMap
Shuffle Read
Shuffle Read
ShuffleWrite
ShuffleWrite
Map
Stage N Stage N+1
Stage N Stage N+1
worker A
worker B
worker A
worker B
ShuffleRead
(a)
(b)
Push-based Shuffle‣ Failure Recovery
�9
time0 4 8 12 16 20 24
Map FailedReduce
Reduce
Refetch ReduceShuffle Read
Shuffle Read
ShuffleWrite
time0 4 8 12 16 20 24
Map
Stage N Stage N+1
worker A
worker B(a)
Map FailedReduce
Reduce
Reduce
Map
Stage N Stage N+1
worker A
worker B(b)
Data Transfer
Data Transfer
ShuffleWrite
ShuffleRead
Refetch
Where to Push?‣ Optional: existing task placement algorithms
‣ Know reducer placement before hand
‣ Require prior knowledge
‣ e.g., predictable jobs, inter-DC available bandwidth
‣ Our solution: Push/Aggregate
�10
‣ Send shuffle input to a subset of datacenters with a large portion of shuffle input
‣ Reduce inter-datacenter traffic in future shuffles
‣ Likely to reduce inter-datacenter traffic at current shuffle
Aggregating Shuffle Input
�11
Datacenter AReceiver of shuffle
input
Reducer 1
Reducer 3
Datacenter B
Reducer 2
Reducer 4
Reducer 5
Inter-DC Transfers
For any partition of shuffle input, the expected inter-datacenter traffic in next shuffle is proportional to the number of non-colocated reducers.
�12
Aggregating Shuffle Input‣ Send shuffle input to a subset of datacenters with a
large portion of shuffle input
‣ Reducer is likely to be placed close to shuffle input
‣ More aggregated data -> less inter-datacenter traffic with reasonable task placement
�13
Implementation in Spark‣ Requirements:
‣ Push before writing to disk
‣ Destined to the aggregator datacenters
‣ transferTo() as an RDD transformation
‣ Allow implicit or explicit usage
�14
Implementation in Spark
�15
InputRDD
A1 B1A2
mapA1
mapA2
mapB1
reduceA1, A2
B1
reduceA1, A2
B1
InputRDD .map(…) .reduce(…) …
InputRDD .map(…) .transferTo([A]) .reduce(…) …
InputRDD
A1 B1A2
mapA1
mapA2
mapB1
reduceA1, A2, Ax
transferTo
A1transferTo
A2transferTo
A*
reduceA1, A2, Ax
(a) (b)
Implementation in Spark
�16
5/12/2016 ScalaWordCount - Details for Job 0
http://54.173.130.234:4040/jobs/job/?id=0 1/2
1.6.1(/) ScalaWordCount application UI
(kill)(/stages/stage/kill/?id=0&terminate=true)
Details for Job 0
Status: RUNNINGActive Stages: 1Pending Stages: 1
Stage 0
map
sequenceFile
map
flatMap
Stage 1
map
reduceByKey
Active Stages (1)
Stage
Id Description Submitted Duration
Tasks:
Succeeded/Total Input Output
Shuffle
Read
Shuffle
Write
0 2016/05/1207:01:01
8 s 128.0MB
24.1KB
Jobs (/jobs/) Stages (/stages/) Storage (/storage/) Environment (/environment/)
Executors (/executors/)
1/9
5/12/2016 ScalaWordCount - Details for Job 0
http://54.173.130.234:4040/jobs/job/?id=0 1/2
1.6.1 (/) ScalaWordCount application UI
Details for Job 0Status: RUNNINGActive Stages: 1Pending Stages: 1
Stage 0
transferTo
sequenceFile
map
map
flatMap
Stage 1
map
e uce e
Active Stages (1)
Jobs (/jobs/) Stages (/stages/) Storage (/storage/) Environment (/environment/)
Executors (/executors/)
(a) (b)
embeddedtransformation
Implementation in Spark‣ transferTo() implicit insertion
�17
DC1 DC2
Origin Code
val InRDD = In1+In2
InRDD .filter(…) .groupByKey(…) .collect()
Produced Code
val InRDD = In1+In2
InRDD .filter(…) .transferTo(…) .groupByKey(…) .collect()
In1
filter
groupByKey
Shuffle Input
collect
In2
filter
groupByKey
Shuffle Input
DC1 DC2
In1
filter
groupByKey
Shuffle Input
collect
In2
filter
groupByKey
Shuffle Input
transferTo
Processed ByDAGScheduler
transferTo
Evaluation‣ Amazon EC2, m3.large instances
‣ 26 nodes in 6 different locations
�18
4 6
4
4
4
4N. Virginia
N. California
São Paulo
Frankfurt
Singapore
Sydney
Performance
�19
The lower, the better
Take-Away Messages‣ Push-based shuffle mechanism is beneficial in wide-area
data analytics
‣ Aggregating shuffle input to a subset of datacenters is likely to help when you have no priori knowledge
‣ Implementation in Apache Spark as a data transformation
‣ Performance: reduced shuffle time and its variance
�20
Thanks! Q&A
�21