Optimizing Shufﬂe in Wide-Area Data Analytics...Wide-Area Data Analytics 3 N. California Ireland...

Optimizing Shuffle inWide-Area Data AnalyticsShuhao Liu*, Hao Wang, Baochun Li Department of Electrical & Computer Engineering University of Toronto

What is: - Wide-Area Data Analytics? - Shuffle?

�2

Wide-Area Data Analytics

�3

N. California

Ireland

Singapore

Oregon

N. Virginiadata 1

data 2

data 3

Large volumes of data are generated, stored and processed across geographically distributed DCs.

Existing work focuses on Task Placement

�4

Rethink the root cause of inter-DC traffic: Shuffle

�5

Fetch-based Shuffle

�6

Mapper 1

Mapper 2

Mapper 3

Reducer 1

Reducer 3

Reducer 2

All-to-allCommunicationat Beginning of Reduce Tasks

Problems with Fetch‣ Under-utilize the inter-datacenter bandwidth

‣ Start late: beginning of reduce

‣ Start concurrently: share bandwidth

‣ Need for refetch

‣ Possible reduce task failure

�7

Push-based Shuffle‣ Bandwidth Utilization

�8

time0 4 8 12 16

time0 4 8 12 16

Map Reduce

Reduce

Map Data Transfer

Data Transfer

Reduce

ReduceMap

Shuffle Read

Shuffle Read

ShuffleWrite

ShuffleWrite

Map

Stage N Stage N+1

Stage N Stage N+1

worker A

worker B

worker A

worker B

ShuffleRead

(a)

(b)

Push-based Shuffle‣ Failure Recovery

�9

time0 4 8 12 16 20 24

Map FailedReduce

Reduce

Refetch ReduceShuffle Read

Shuffle Read

ShuffleWrite

time0 4 8 12 16 20 24

Map

Stage N Stage N+1

worker A

worker B(a)

Map FailedReduce

Reduce

Reduce

Map

Stage N Stage N+1

worker A

worker B(b)

Data Transfer

Data Transfer

ShuffleWrite

ShuffleRead

Refetch

Where to Push?‣ Optional: existing task placement algorithms

‣ Know reducer placement before hand

‣ Require prior knowledge

‣ e.g., predictable jobs, inter-DC available bandwidth

‣ Our solution: Push/Aggregate

�10

‣ Send shuffle input to a subset of datacenters with a large portion of shuffle input

‣ Reduce inter-datacenter traffic in future shuffles

‣ Likely to reduce inter-datacenter traffic at current shuffle

Aggregating Shuffle Input

�11

Datacenter AReceiver of shuffle

input

Reducer 1

Reducer 3

Datacenter B

Reducer 2

Reducer 4

Reducer 5

Inter-DC Transfers

For any partition of shuffle input, the expected inter-datacenter traffic in next shuffle is proportional to the number of non-colocated reducers.

�12

Aggregating Shuffle Input‣ Send shuffle input to a subset of datacenters with a

large portion of shuffle input

‣ Reducer is likely to be placed close to shuffle input

‣ More aggregated data -> less inter-datacenter traffic with reasonable task placement

�13

Implementation in Spark‣ Requirements:

‣ Push before writing to disk

‣ Destined to the aggregator datacenters

‣ transferTo() as an RDD transformation

‣ Allow implicit or explicit usage

�14

Implementation in Spark

�15

InputRDD

A1 B1A2

mapA1

mapA2

mapB1

reduceA1, A2

B1

reduceA1, A2

B1

InputRDD .map(…) .reduce(…) …

InputRDD .map(…) .transferTo([A]) .reduce(…) …

InputRDD

A1 B1A2

mapA1

mapA2

mapB1

reduceA1, A2, Ax

transferTo

A1transferTo

A2transferTo

A*

reduceA1, A2, Ax

(a) (b)

Implementation in Spark

�16

5/12/2016 ScalaWordCount - Details for Job 0

http://54.173.130.234:4040/jobs/job/?id=0 1/2

1.6.1(/) ScalaWordCount application UI

(kill)(/stages/stage/kill/?id=0&terminate=true)

Details for Job 0

Status: RUNNINGActive Stages: 1Pending Stages: 1

Stage 0

map

sequenceFile

map

flatMap

Stage 1

map

reduceByKey

Active Stages (1)

Stage

Id Description Submitted Duration

Tasks:

Succeeded/Total Input Output

Shuffle

Read

Shuffle

Write

0 2016/05/1207:01:01

8 s 128.0MB

24.1KB

Jobs (/jobs/) Stages (/stages/) Storage (/storage/) Environment (/environment/)

Executors (/executors/)

1/9

5/12/2016 ScalaWordCount - Details for Job 0

http://54.173.130.234:4040/jobs/job/?id=0 1/2

1.6.1 (/) ScalaWordCount application UI

Details for Job 0Status: RUNNINGActive Stages: 1Pending Stages: 1

Stage 0

transferTo

sequenceFile

map

map

flatMap

Stage 1

map

e uce e

Active Stages (1)

Jobs (/jobs/) Stages (/stages/) Storage (/storage/) Environment (/environment/)

Executors (/executors/)

(a) (b)

embeddedtransformation

Implementation in Spark‣ transferTo() implicit insertion

�17

DC1 DC2

Origin Code

val InRDD = In1+In2

InRDD .filter(…) .groupByKey(…) .collect()

Produced Code

val InRDD = In1+In2

InRDD .filter(…) .transferTo(…) .groupByKey(…) .collect()

In1

filter

groupByKey

Shuffle Input

collect

In2

filter

groupByKey

Shuffle Input

DC1 DC2

In1

filter

groupByKey

Shuffle Input

collect

In2

filter

groupByKey

Shuffle Input

transferTo

Processed ByDAGScheduler

transferTo

Evaluation‣ Amazon EC2, m3.large instances

‣ 26 nodes in 6 different locations

�18

4 6

4

4

4

4N. Virginia

N. California

São Paulo

Frankfurt

Singapore

Sydney

Performance

�19

The lower, the better

Take-Away Messages‣ Push-based shuffle mechanism is beneficial in wide-area

data analytics

‣ Aggregating shuffle input to a subset of datacenters is likely to help when you have no priori knowledge

‣ Implementation in Apache Spark as a data transformation

‣ Performance: reduced shuffle time and its variance

�20

Thanks! Q&A

�21

Optimizing Shufﬂe in Wide-Area Data Analytics...Wide-Area Data Analytics 3 N. California Ireland...

Documents

Transcript of Optimizing Shufﬂe in Wide-Area Data Analytics...Wide-Area Data Analytics 3 N. California Ireland...