LIBRA: Lightweight Data Skew Mitigation in MapReduce Qi Chen, Jinyu Yao, and Zhen Xiao Nov 2014 To...
-
Upload
estefani-peller -
Category
Documents
-
view
221 -
download
5
Transcript of LIBRA: Lightweight Data Skew Mitigation in MapReduce Qi Chen, Jinyu Yao, and Zhen Xiao Nov 2014 To...
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Qi Chen, Jinyu Yao, and Zhen Xiao
Nov 2014To appear in IEEE Transactions on Parallel and Distributed Systems
0
Outlines
2. Background
3. Previous work
4. System Design
1. Introduction
5. Evaluation
6. Conclusion
0
Outlines
2. Background
3. Previous work
4. System Design
1. Introduction
5. Evaluation
6. Conclusion
Introduction The new era of Big Data is coming!
– 20 PB per day (2008)
– 30 TB per day (2009)
– 60 TB per day (2010)
–petabytes per day
What does big data mean? Important user information
significant business value
MapReduce
What is MapReduce? most popular parallel computing model proposed by
database operatio
n
Search engine
Machine learning
Cryptanalysis
Scientific computati
on
Applications
…
Select, Join, Group
Page rank,Inverted index,Log analysis
Clustering, machine translation,
Recommendation
Data skew in MapReduce
The imbalance in the amount of data assigned to each task
Fundamental reason: The datasets in the real world are often skewed
physical properties, hot spots
We do not know the data distribution beforehand
It cannot be solved by speculative execution
Mantri has witnessed the Coefficients of variation in data
size across tasks are 0.34 and 3.1 at the 50th and 90th percentiles in the
Microsoft production cluster
0
Outlines
2. Background
3. Previous work
4. System Design
1. Introduction
5. Evaluation
6. Conclusion
Architecture
Split 1
Split 2
…
Split M
Map Part 2
Part 1
Map Part 2
Part 1
Map Part 2
Part 1
Reduce
Reduce
Output2
Input files
Map Stage
Reduce Stage
Output files
Output1
Master
…
Assign
Assign
[(K1,V1)] → [(K2,V2
)] → [(K2, [V2])]
map
combine [(K2,
[V2])] → [(K2, [V2])] → [(K3,V3
)]→copy
sort reduce
Intermediate data are divided according to some
user defined partitioner
Challenges to solve data skew
Many real world applications exhibit data skew Sort, Grep, Join, Group, Aggregation, Page Rank, Inverted Index,
etc.
The data distribution cannot be determined ahead of time
The computing environment can be heterogeneous Diversity of hardware
Resource competition in cloud environment
0
Outlines
2. Background
3. Previous work
4. System Design
1. Introduction
5. Evaluation
6. Conclusion
Previous work In the parallel database area
limited on join, group, and aggregate operations
Pre-run sampling jobs
Adding two pre-run sampling and counting jobs for theta join (SIGMOD’11)
Operating pre-processing extracting and sampling procedures for the spatial feature extraction (SOCC’11)
Collect data information during the job execution
Collecting key frequency in each node and aggregating them on the master after all maps done (Cloudcom’10)
Partitioning intermediate data into more partitions and using greedy bin-packing to pack them after all maps finish (CLOSER’11, ICDE’12)
Skewtune (SIGMOD’12)
Split skewed tasks when detected
Reconstruct the output by concatenating the results
Bring barrier between map and reduce phases
Bin-packing cannot support total order
Significant overhead Applicable only to
certain applications
Need more task slots Cannot detect large
keys Cannot split in copy
and sort phases
0
Outlines
2. Background
3. Previous work
4. System Design
1. Introduction
5. Evaluation
6. Conclusion
LIBRA – Solving data skew
HDFS
NormalMap
Sample Map
NormalMap
SampleMap
Master2:Sample Data
Reduce
Reduce
HDFS
1: Issue Sample Tasks First
3: Calculate Partitions
4: Ask Workers to Partition Map Output
Sampling and partitioning
Sampling strategy
Random, TopCluster (ICDE’12)
LIBRA – p largest keys and q random keys
Estimate Intermediate Data Distribution
Large keys -> represent only one large key
Random keys -> represent a small range keys
Partitioning strategy
Hash, bin packing, range
LIBRA - range
Heterogeneity Consideration
Performance=1.5
Cnt=300
Performance=1 Performance=0.5
Node1 Node2 Node3
Cnt=150Cnt=100
Cnt=50
Intermediate data
Reducer1 Reducer2 Reducer3
Finish
Processing
Start
Problem Statement
The intermediate data can be represented as:
(K1, C1), (K2, C2), …, (Kn, Cn) Ki < Ki+1
Ki a distinct key Ci number of (k,v) pairs of Ki
Range partition: 0 = < < … < = n
Reducer keys in the range of (, ]
Our goal:
Minimize
computational complexity of processing Kj
sort: , self-join:
performance factor of the worker node
Distribution estimation
Minimize
(),
……
(),(),(),
……
(),(),(),
……
()
L
(),
……
(),(),(),
……
(),(),(),
……
()
L
(, )
(, )
(, ),
……
(, )
(, ),
……
(, )
(, ),
……
(, )
K1
K2
K3
…
Ki-1Ki
Ki+1
K|L|
…
P1 keys, Q1 tuples
P2 keys, Q2 tuples
P3 keys, Q3 tuples
Pi (=1) keys, Qi tuples
Pi+1 keys, Qi+1 tuples
(a) Sum up samples (b) Pick up “marked keys” (c) Estimate distribution
Sparse Index to Speed Up Partitioning
Intermediate data
(Kb1, Vb1)
(Kb1+1, Vb1+1)……
Index chunk
Sparse index
……
(Kb1, Offset1, L1, Checksum1)
Offsetn
L1
L2
Ln
(Kb2, Offset2, L2, Checksum2)
(Kbn, Offsetn, Ln, Checksumn)
(Kb2, Vb2)
(Kb2+1, Vb2+1)……
(Kbn, Vbn)
(Kbn+1, Vbn+1)……
Offset2
Offset1
decrease the partition time by an order of magnitude
Large Cluster Splitting
treat each intermediate (k,v) pair independently in reduce phase
e.g. sort, grep, joinA, cnt = 100 B, cnt = 10 C, cnt = 10
Cluster split is not allow
A, cnt=100
B, cnt = 10
C, cnt = 10
Cluster split is allow
A, cnt=60
A, cnt = 40
B, cnt = 10
C, cnt = 10
Reducer 1 Reducer 2 Reducer 1 Reducer 2
Data Skewed
0
Outlines
2. Background
3. Previous work
4. System Design
1. Introduction
5. Evaluation
6. Conclusion
Experiment Environment
Cluster: 30 virtual machines on 15 physical machines
Each physical machine: dual-Processors (2.4GHz Xeon E5620)
24GB of RAM
two 150GB disks
connected by 1Gbps Ethernet
Each virtual machine: 2 virtual core, 4GB RAM and 40GB of disk space
Benchmark: Sort, Grep, Inverted Index, join
Evalution - Accuracy of the Sampling Method Zipf distribution (=
1.0)#keys = 65535
Sample 20% of splits and 1000 keys from
each split
Evaluation – LIBRA Execution (sort)
80% faster than Hadoop Hash
167% faster than Hadoop Range
Evaluation – Degree of the skew (sort)
The overhead of LIBRA is minimal
Evaluation – different applications
Grep application -- grep different words from the full English Wikipedia archive with total data size of 31GB
Inverted Index application Dataset: full English Wikipedia archive
Evaluation – different applications
Join application
Evaluation – different applications
Evaluation – Heterogeneous Environments (sort)
30% faster than without heterogeneous consideration
0
Outlines
2. Background
3. Previous work
4. System Design
1. Introduction
5. Evaluation
6. Conclusion
Conclusion
We present LIBRA, a system that implements a set of innovative skew mitigation strategies in MapReduce:
A new sampling method for general user-defined programs
p largest keys and q random keys
An approach to balance the load among the reduce tasks
Large key split support
An innovative consideration of heterogeneous environment
Balance the processing time instead of just the amount of data
Performance evaluation demonstrates that:
the improvement is significant (up to 4x times faster)
the overhead is minimal and negligible even in the absence of skew
Thank You!