LIBRA: Lightweight Data Skew Mitigation in MapReduce Qi Chen, Jinyu Yao, and Zhen Xiao Nov 2014 To...

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Qi Chen, Jinyu Yao, and Zhen Xiao

Nov 2014To appear in IEEE Transactions on Parallel and Distributed Systems

0

Outlines

2. Background

3. Previous work

4. System Design

1. Introduction

5. Evaluation

6. Conclusion

Introduction The new era of Big Data is coming!

– 20 PB per day (2008)

– 30 TB per day (2009)

– 60 TB per day (2010)

–petabytes per day

What does big data mean? Important user information

significant business value

MapReduce

What is MapReduce? most popular parallel computing model proposed by

Google

database operatio

n

Search engine

Machine learning

Cryptanalysis

Scientific computati

on

Applications

…

Select, Join, Group

Page rank,Inverted index,Log analysis

Clustering, machine translation,

Recommendation

Data skew in MapReduce

The imbalance in the amount of data assigned to each task

Fundamental reason: The datasets in the real world are often skewed

physical properties, hot spots

We do not know the data distribution beforehand

It cannot be solved by speculative execution

Mantri has witnessed the Coefficients of variation in data

size across tasks are 0.34 and 3.1 at the 50th and 90th percentiles in the

Microsoft production cluster

0

Outlines

2. Background

3. Previous work

4. System Design

1. Introduction

5. Evaluation

6. Conclusion

Architecture

Split 1

Split 2

…

Split M

Map Part 2

Part 1

Map Part 2

Part 1

Map Part 2

Part 1

Reduce

Reduce

Output2

Input files

Map Stage

Reduce Stage

Output files

Output1

Master

…

Assign

Assign

[(K1,V1)] → [(K2,V2

)] → [(K2, [V2])]

map

combine [(K2,

[V2])] → [(K2, [V2])] → [(K3,V3

)]→copy

sort reduce

Intermediate data are divided according to some

user defined partitioner

Challenges to solve data skew

Many real world applications exhibit data skew Sort, Grep, Join, Group, Aggregation, Page Rank, Inverted Index,

etc.

The data distribution cannot be determined ahead of time

The computing environment can be heterogeneous Diversity of hardware

Resource competition in cloud environment

0

Outlines

2. Background

3. Previous work

4. System Design

1. Introduction

5. Evaluation

6. Conclusion

Previous work In the parallel database area

limited on join, group, and aggregate operations

Pre-run sampling jobs

Adding two pre-run sampling and counting jobs for theta join (SIGMOD’11)

Operating pre-processing extracting and sampling procedures for the spatial feature extraction (SOCC’11)

Collect data information during the job execution

Collecting key frequency in each node and aggregating them on the master after all maps done (Cloudcom’10)

Partitioning intermediate data into more partitions and using greedy bin-packing to pack them after all maps finish (CLOSER’11, ICDE’12)

Skewtune (SIGMOD’12)

Split skewed tasks when detected

Reconstruct the output by concatenating the results

Bring barrier between map and reduce phases

Bin-packing cannot support total order

Significant overhead Applicable only to

certain applications

Need more task slots Cannot detect large

keys Cannot split in copy

and sort phases

0

Outlines

2. Background

3. Previous work

4. System Design

1. Introduction

5. Evaluation

6. Conclusion

LIBRA – Solving data skew

HDFS

NormalMap

Sample Map

NormalMap

SampleMap

Master2:Sample Data

Reduce

Reduce

HDFS

1: Issue Sample Tasks First

3: Calculate Partitions

4: Ask Workers to Partition Map Output

Sampling and partitioning

Sampling strategy

Random, TopCluster (ICDE’12)

LIBRA – p largest keys and q random keys

Estimate Intermediate Data Distribution

Large keys -> represent only one large key

Random keys -> represent a small range keys

Partitioning strategy

Hash, bin packing, range

LIBRA - range

Heterogeneity Consideration

Performance=1.5

Cnt=300

Performance=1 Performance=0.5

Node1 Node2 Node3

Cnt=150Cnt=100

Cnt=50

Intermediate data

Reducer1 Reducer2 Reducer3

Finish

Processing

Start

Problem Statement

The intermediate data can be represented as:

(K1, C1), (K2, C2), …, (Kn, Cn) Ki < Ki+1

Ki a distinct key Ci number of (k,v) pairs of Ki

Range partition: 0 = < < … < = n

Reducer keys in the range of (, ]

Our goal:

Minimize

computational complexity of processing Kj

sort: , self-join:

performance factor of the worker node

Distribution estimation

Minimize

(),

……

(),(),(),

……

(),(),(),

……

()

L

(),

……

(),(),(),

……

(),(),(),

……

()

L

(, )

(, )

(, ),

……

(, )

(, ),

……

(, )

(, ),

……

(, )

K1

K2

K3

…

Ki-1Ki

Ki+1

K|L|

…

P1 keys, Q1 tuples

P2 keys, Q2 tuples

P3 keys, Q3 tuples

Pi (=1) keys, Qi tuples

Pi+1 keys, Qi+1 tuples

(a) Sum up samples (b) Pick up “marked keys” (c) Estimate distribution

Sparse Index to Speed Up Partitioning

Intermediate data

(Kb1, Vb1)

(Kb1+1, Vb1+1)……

Index chunk

Sparse index

……

(Kb1, Offset1, L1, Checksum1)

Offsetn

L1

L2

Ln

(Kb2, Offset2, L2, Checksum2)

(Kbn, Offsetn, Ln, Checksumn)

(Kb2, Vb2)

(Kb2+1, Vb2+1)……

(Kbn, Vbn)

(Kbn+1, Vbn+1)……

Offset2

Offset1

decrease the partition time by an order of magnitude

Large Cluster Splitting

treat each intermediate (k,v) pair independently in reduce phase

e.g. sort, grep, joinA, cnt = 100 B, cnt = 10 C, cnt = 10

Cluster split is not allow

A, cnt=100

B, cnt = 10

C, cnt = 10

Cluster split is allow

A, cnt=60

A, cnt = 40

B, cnt = 10

C, cnt = 10

Reducer 1 Reducer 2 Reducer 1 Reducer 2

Data Skewed

0

Outlines

2. Background

3. Previous work

4. System Design

1. Introduction

5. Evaluation

6. Conclusion

Experiment Environment

Cluster: 30 virtual machines on 15 physical machines

Each physical machine: dual-Processors (2.4GHz Xeon E5620)

24GB of RAM

two 150GB disks

connected by 1Gbps Ethernet

Each virtual machine: 2 virtual core, 4GB RAM and 40GB of disk space

Benchmark: Sort, Grep, Inverted Index, join

Evalution - Accuracy of the Sampling Method Zipf distribution (=

1.0)#keys = 65535

Sample 20% of splits and 1000 keys from

each split

Evaluation – LIBRA Execution (sort)

80% faster than Hadoop Hash

167% faster than Hadoop Range

Evaluation – Degree of the skew (sort)

The overhead of LIBRA is minimal

Evaluation – different applications

Grep application -- grep different words from the full English Wikipedia archive with total data size of 31GB

Inverted Index application Dataset: full English Wikipedia archive


Join application


Evaluation – Heterogeneous Environments (sort)

30% faster than without heterogeneous consideration

0

Outlines

2. Background

3. Previous work

4. System Design

1. Introduction

5. Evaluation

6. Conclusion

Conclusion

We present LIBRA, a system that implements a set of innovative skew mitigation strategies in MapReduce:

A new sampling method for general user-defined programs

p largest keys and q random keys

An approach to balance the load among the reduce tasks

Large key split support

An innovative consideration of heterogeneous environment

Balance the processing time instead of just the amount of data

Performance evaluation demonstrates that:

the improvement is significant (up to 4x times faster)

the overhead is minimal and negligible even in the absence of skew

Thank You!

LIBRA: Lightweight Data Skew Mitigation in MapReduce Qi Chen, Jinyu Yao, and Zhen Xiao Nov 2014 To...

Documents

Transcript of LIBRA: Lightweight Data Skew Mitigation in MapReduce Qi Chen, Jinyu Yao, and Zhen Xiao Nov 2014 To...