MapReduce performance prediction
-
Upload
engin-soezer -
Category
Documents
-
view
233 -
download
0
Transcript of MapReduce performance prediction
-
8/12/2019 MapReduce performance prediction
1/29
A Hadoop MapReduce
Performance Prediction Method
Ge Song*+, Zide Meng*, Fabrice Huet*, Frederic Magoules+, Lei Yu#and Xuelian Lin#
* University of Nice Sophia Antipolis, CNRS, I3S, UMR 7271, France
+ Ecole Centrale de Paris, France# Beihang University, Beijing China
1
-
8/12/2019 MapReduce performance prediction
2/29
Background
Hadoop MapReduce
I
N
P
U
T
D
A
T
A
Split
Map
Map
Map
Map
Reduce
Reduce
JobMap
Reduce
Map
Reduce
Map
Map
Map
Reduce
+
(Key, Value)Partion1
Partion2
HDFS 2
-
8/12/2019 MapReduce performance prediction
3/29
Background
Hadoop
Many steps within Map stage and Reduce stage
Different step may consume different type of resource
RE
A
D
Map
SO
R
T
ME
R
G
E
O
U
T
P
U
T
Map
3
-
8/12/2019 MapReduce performance prediction
4/29
Motivation
Problems
SchedulingNo consideration about the execution time
and different type of resources consumed
Hadoop
ParameterTuning
Numerous parameters, default value is notoptimal
Hadoop
CPU
Intensive
CPU
Intensive
Hadoop
Default
HadoopJobHadoop
Job
Default Conf
4
-
8/12/2019 MapReduce performance prediction
5/29
Motivation
Solution
Predict the performance of Hadoop Jobs
Scheduling
Hadoop
Parameter
Tuning
Numerous parameters, default value is not
optimal
No consideration about the execution time
and different type of resources consumed
5
-
8/12/2019 MapReduce performance prediction
6/29
Related Work
Existing Prediction Method 1- Black Box Based
Job
Features
Hadoop
Statistic/Learning
Models
Execution
Time
Lack of the
analysis about
Hadoop
Hard to
choose
6
-
8/12/2019 MapReduce performance prediction
7/29
Related Work
Existing Prediction Method 2- Cost Model Based
Job
Feature
F(map)=f(read,map,sort,spill,merge,write)
F(reduce)=f(read,write,merge,reduce,write)
Execution
Time
Difficult to
ensure
accuracy
Lots of concurrent
processes
Hard to divide stages
HadoopRead
Hadoop
mapOut
put Read reduce
Out
put
7
-
8/12/2019 MapReduce performance prediction
8/29
Related Work
A Brief Summary about Existing Prediction Method
Black Box Cost Model
Advantage Simple and Effective
High accuracyHigh isomorphism
Detailed analysis about Hadoop
processingDivision is flexible (stage, resource)
Multiple prediction
Short
Coming
Lack of job feature extraction
Lack of analysis
Hard to divide each step and
resource
Lack of job feature extraction
A lot of concurrent, hard to model
Better for theoretical analysis, not
suitable for prediction
o Simple prediction,
o Lack of jobs (jar package + data) analysis
8
-
8/12/2019 MapReduce performance prediction
9/29
Goal
Design a Hadoop MapReduce performance prediction
system to:
- Predict the job consumption of various type of resources
(CPU, Disk IO, Network)
- Predict the execution time of Map phase and Reduce phase
Prediction System
- Map execution time
- Reduce execution
time
- CPU Occupation Time
- Disk Occupation Time
- Network Occupation
Time
Job
9
-
8/12/2019 MapReduce performance prediction
10/29
Design - 1
Cost Model
- Map execution
time
- Reduce execution
time
- CPU Occupation Time- Disk Occupation Time
- Network Occupation
Time
CO
S
T
M
OD
E
L
Job
10
-
8/12/2019 MapReduce performance prediction
11/29
-
8/12/2019 MapReduce performance prediction
12/29
Cost Model [1]
Cost Function Parameters Analysis
Type OneConstant
Hadoop System ConsumeInitialization Consume
Type TwoJob-related Parameters Map Function Computational ComplexityMap Input
Records
Type ThreeParameters defined by Cost Model
Sorting Coefficient, Complexity Factor
[1] X. Lin, Z. Meng, C. Xu, and M. Wang, A practical performance model for hadoop mapreduce, in CLUSTER Workshops, 2012, pp. 231239.12
-
8/12/2019 MapReduce performance prediction
13/29
Parameters Collection
Type One and Type Three Type one: Run empty map taskscalculate the system
consumedfromthe logs
Type Three: Extract the sort part from Hadoop source
code, sort a certain number of records. Type Two
Run a new jobanalyze log High Latency
Large Overhead
Sampling Dataonly analyze the behavior of mapfunction and reduce function Almost no latency
Very low extra overhead Job Analyzer
13
-
8/12/2019 MapReduce performance prediction
14/29
Job Analyzer - Implementation
Job Analyzer Implementation Hadoop virtual execution environment
Accept the job Jar File & Input Data
Sampling Module
Sample input data by a certainpercentage (less than 5%).
MR Module Instantiate user jobs class in
using Java reflection
Analyze Module Input Data (Amount & Number)
Relative computational complexity
Data conversion rates (output/input)
Sampling
Module
MR
Module
Analyze Module
Hadoop virtual execution
environment
Jar File + Input Data
Job Feature
14
-
8/12/2019 MapReduce performance prediction
15/29
Job Analyzer - Feasibility
Data similarity: Logs have uniform format
Execution similarity: each record will be processed by the
same map & reduce function repeatedly
I
N
P
U
T
D
A
T
A
Split
Map
Map
Map
Map
Reduce
Reduce
15
-
8/12/2019 MapReduce performance prediction
16/29
Design - 2
Parameters Collection
- Map execution
time
- Reduce execution
time
- CPU Occupation Time- Disk Occupation Time
- Network Occupation
Time
C
O
S
T
M
OD
E
L
Job Analyzer:Collect
Parameters of
Type 2
Static Parameters
Collection Module:
Collect
Parameters of
Type1 & Type 3
16
-
8/12/2019 MapReduce performance prediction
17/29
Prediction Model
Problem Analysis-Many concurrent steps -- the total time can not be
added up by the time of each part
Initiation
Read Data
Network
Transfer
CreateObject
Map Function
Sort
In
Memory
Read/Write
Disk
Merge
Sort
Write
Disk
Serializat
ion
CPU:Disk:
Net:
17
-
8/12/2019 MapReduce performance prediction
18/29
Prediction Model
Main Factors(according to the performancemodel)- Map Stage
Initiation
Read Data
NetworkTransfer
Create
Object
Map Function
Sort
In
Memory
Read/Write
Disk
Merge
Sort
Write
Disk
Serializ
ation
Tmap=0
+1*MapInput
+2*N
+3*N*Log(N)
+4*The complexity of map function
+5*The conversion rate of map data
The amount of input data
The number of input records (N)
The complexity of Map function
The conversion rate of Map data
NlogN
18
-
8/12/2019 MapReduce performance prediction
19/29
-
8/12/2019 MapReduce performance prediction
20/29
-
8/12/2019 MapReduce performance prediction
21/29
Find the nearest jobs!
Instance-Based Linear Regression Find the nearest samples to the jobs to be predicted in
history logs
nearest-> similar jobs (Top K nearest, with K=10%-15%)
Do linear regression to the samples we have found
Calculate the prediction value
Nearest The weighted distance of job features (weight w)
High contribution for job classification
map/reduce complexitymap/reduce data conversion rate
Low contribution for job classification Data amountNumber of records
21
-
8/12/2019 MapReduce performance prediction
22/29
Prediction Module
Procedure
CostModel
MainFactors
Tmap=0+1*MapInput
+2*N
+3*N*Log(N)
+4*The complexity of map
function+5*The conversion rate of map
data
Job Features
Search for the nearest
samples
Prediction Function
Prediction Results
1 2
3
4
5
6
7
22
-
8/12/2019 MapReduce performance prediction
23/29
Prediction Module
Procedure
Training Set
Find-Neighbor Module
Prediction Results
Prediction Function
Cost Model
23
-
8/12/2019 MapReduce performance prediction
24/29
Design - 3
Parameters Collection
- CPU Occupation Time- Disk Occupation Time
- Network Occupation
Time
C
O
S
T
M
OD
E
L
Job Analyzer:Collect
Parameters of
Type 2
Static Parameters
Collection Module:
Collect
Parameters of
Type1 & Type 3
- Map execution
time
- Reduce execution
timePrediction
Module
24
-
8/12/2019 MapReduce performance prediction
25/29
Experience
Task Execution Time (Error Rate) K=12%, and with w different for each feature
K=12%, and with w the same for each feature
K=25%, and with w different for each feature
4 kinds of jobs, 64M-8G
0
20
40
60
80
100
120
140
160
180
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
ErrorRate(100%)
Reduce Tasks
k=12%
k=25%
k=12%,w=1
0
10
20
30
40
5060
70
80
90
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
ErrorRate(100%
Map Tasks
Job ID Job ID 25
-
8/12/2019 MapReduce performance prediction
26/29
Conclusion
Job Analyzer :
Analyze Job Jar + Input File
Collect parameters
Prediction Module:
Find the main factor
Propose a linear equation
Job classification
Multiple prediction
26
-
8/12/2019 MapReduce performance prediction
27/29
27
Thank you!
Question?
-
8/12/2019 MapReduce performance prediction
28/29
Cost Model [1]
Analysis about Reduce
- Modeling the resources (CPU Disk Network) consumption
- Each stage involves only one type of resources
Initiation
Read
Data
Network
Transfer
Create Object
Reduce Function
Merge
SortRead/Write
Disk
Network
Write DiskSerialization
Deserialization
Reduce CPU:Disk:
Net:
28
-
8/12/2019 MapReduce performance prediction
29/29
Prediction Model
Main Factors (according to the performancemodel)- Reduce Stage
Initiation Read
Data
Network
Transfer
Create Object
Reduce Function
Merge
Sort
Read/WriteDisk
Network
Write DiskSerialization
Deserialization
Treduce=0+1*MapInput
+2*N
+3*Nlog(N)
+4*The complexity of Reduce function
+5*The conversion rate of Map data
+6*The conversion rate of Reduce data
The amount of input data
The number of input records
The complexity of Reduce
function
The conversion rate of Map data
NlogN
The conversion rate of Reducedata
29