A Hadoop MapReduce Performance Prediction Method
-
Upload
tarik-chandler -
Category
Documents
-
view
34 -
download
3
description
Transcript of A Hadoop MapReduce Performance Prediction Method
1
A Hadoop MapReduce Performance Prediction Method
Ge Song*+, Zide Meng*, Fabrice Huet*, Frederic Magoules+, Lei Yu# and Xuelian Lin#
* University of Nice Sophia Antipolis, CNRS, I3S, UMR 7271, France
+ Ecole Centrale de Paris, France
# Beihang University, Beijing China
2
Background
• Hadoop MapReduce
INPUT
DATA
Split
Map
Map
Map
Map
Reduce
Reduce
JobMap
ReduceMap
ReduceMapMapMap
Reduce+
(Key, Value)Partion1Partion2
HDFS
3
Background
• Hadoop• Many steps within Map stage and Reduce stage• Different step may consume different type of resource
READ
Map
SORT
MERGE
OUTPUT
Map
4
Motivation
• Problems
SchedulingNo consideration about the execution time and different type of resources consumed
Hadoop
ParameterTuning
Numerous parameters, default value is not optimal
Hadoop
CPUIntensive
CPUIntensive
Hadoop
DefaultHadoopJobHadoop
Job
Default Conf
5
Motivation
• Solution
Predict the performance of Hadoop Jobs
Scheduling
Hadoop
ParameterTuning
Numerous parameters, default value is not optimal
No consideration about the execution time and different type of resources consumed
6
Related Work
• Existing Prediction Method 1:- Black Box Based
JobFeatures
Hadoop
Statistic/Learning Models
ExecutionTime
Lack of the analysis about
Hadoop
Hard to choose
7
Related Work
• Existing Prediction Method 2:- Cost Model Based
Job Feature
F(map)=f(read,map,sort,spill,merge,write)F(reduce)=f(read,write,merge,reduce,write)
Execution Time
Difficult to ensure
accuracy
Lots of concurrent processes
Hard to divide stages
HadoopRead
Hadoop
mapOutput
… Read … reduceOutput
8
Related Work
• A Brief Summary about Existing Prediction Method
Black Box Cost Model
Advantage Simple and EffectiveHigh accuracyHigh isomorphism
Detailed analysis about Hadoop processing Division is flexible (stage, resource)Multiple prediction
Short Coming
Lack of job feature extractionLack of analysisHard to divide each step and resource
Lack of job feature extractionA lot of concurrent, hard to modelBetter for theoretical analysis, not suitable for prediction
o Simple prediction,
o Lack of jobs (jar package + data) analysis
9
Goal
• Design a Hadoop MapReduce performance prediction system to:- Predict the job consumption of various type of resources (CPU, Disk IO, Network)- Predict the execution time of Map phase and Reduce phase
Prediction System
- Map execution time- Reduce execution time
- CPU Occupation Time- Disk Occupation Time- Network Occupation Time
Job
10
Design - 1
• Cost Model
- Map execution time- Reduce execution time
- CPU Occupation Time- Disk Occupation Time- Network Occupation Time
COST
MODEL
Job
11
Cost Model [1]
• Analysis about Map- Modeling the resources (CPU Disk Network) consumption- Each stage involves only one type of resources
[1] X. Lin, Z. Meng, C. Xu, and M. Wang, “A practical performance model for hadoop mapreduce,” in CLUSTER Workshops, 2012, pp. 231–239.
Initi
ation
Read Data
NetworkTransfer
CreateObject
Map Function
Sort In
MemoryRead/WriteDisk
MergeSort
WriteDisk
Serialization
MapCPU:Disk:Net:
12
Cost Model [1]
• Cost Function Parameters Analysis
– Type One: Constant• Hadoop System Consume, Initialization Consume
– Type Two: Job-related Parameters• Map Function Computational Complexity,Map
Input Records
– Type Three: Parameters defined by Cost Model• Sorting Coefficient, Complexity Factor
[1] X. Lin, Z. Meng, C. Xu, and M. Wang, “A practical performance model for hadoop mapreduce,” in CLUSTER Workshops, 2012, pp. 231–239.
13
Parameters Collection
• Type One and Type Three– Type one: Run empty map tasks, calculate the system consumed
from the logs– Type Three: Extract the sort part from Hadoop source code, sort a
certain number of records.
• Type Two– Run a new job, analyze log
• High Latency• Large Overhead
– Sampling Data, only analyze the behavior of map function and reduce function• Almost no latency• Very low extra overhead
Job Analyzer
14
Job Analyzer - Implementation
• Job Analyzer – Implementation– Hadoop virtual execution environment
• Accept the job Jar File & Input Data
– Sampling Module• Sample input data by a certain
percentage (less than 5%).
– MR Module• Instantiate user job’s class in
using Java reflection
– Analyze Module• Input Data (Amount & Number)• Relative computational complexity• Data conversion rates (output/input)
SamplingModule
MR Module
Analyze Module
Hadoop virtual execution environment
Jar File + Input Data
Job Feature
15
Job Analyzer - Feasibility
– Data similarity: Logs have uniform format– Execution similarity: each record will be processed by the
same map & reduce function repeatedly
INPUT
DATA
Split
Map
Map
Map
Map
Reduce
Reduce
16
Design - 2
• Parameters Collection
- Map execution time- Reduce execution time
- CPU Occupation Time- Disk Occupation Time- Network Occupation Time
COST
MODEL
Job Analyzer:Collect
Parameters of Type 2
Static Parameters Collection Module:Collect
Parameters of Type1 & Type 3
17
Prediction Model
• Problem Analysis-Many concurrent steps -- the total time can not be added up by the time of each part
Initi
ation
Read Data
NetworkTransfer
CreateObject
Map Function
Sort In
Memory
Read/WriteDisk
MergeSort
WriteDisk
Serialization
CPU:Disk:Net:
18
Prediction Model• Main Factors (according to the performance
model)- Map Stage
Initi
ation
Read Data
NetworkTransfer
CreateObject
Map Function
Sort In
Memory
Read/WriteDisk
MergeSort
WriteDisk
Serialization
Tmap=α0
+α1*MapInput
+α2*N
+α3*N*Log(N)
+α4*The complexity of map function
+α5*The conversion rate of map data
The amount of input data
The number of input records (N)
The complexity of Map function
The conversion rate of Map data
NlogN
19
Prediction Model• Experimental Analysis
– Test 4 kinds of jobs (0-10000 records)– Extract the features for linear regression– Calculate the correlation coefficient (R2)
Jobs Dedup WordCount Project Grep Total
R2 0.9982 0.9992 0.9991 0.9949 0.6157
20
Prediction Model
0 1000 2000 3000 4000 5000 6000 7000 8000 90000
500000
1000000
1500000
2000000
2500000
3000000
3500000
DedupGrepProjectWordCount
Number of Records
Exec
ution
Tim
e of
Map
- Very good linear relationship within the same kind of jobs.
- But no linear relationship among different kind of jobs.
21
Find the nearest jobs!
• Instance-Based Linear Regression– Find the nearest samples to the jobs to be predicted in history
logs – “nearest”-> similar jobs (Top K nearest, with K=10%-15%)– Do linear regression to the samples we have found– Calculate the prediction value
• Nearest:– The weighted distance of job features (weight w)– High contribution for job classification:
• map/reduce complexity,map/reduce data conversion rate
– Low contribution for job classification:• Data amount、Number of records
22
Prediction Module
• Procedure
Cos
t M
odel
Mai
n F
acto
rs
Tmap=α0+α1*MapInput+α2*N+α3*N*Log(N)+α4*The complexity of map function+α5*The conversion rate of map data
Job Features
Search for the nearest samples
Prediction Function
Prediction Results
1 2
3
4
5
6
7
23
Prediction Module
• Procedure
Training Set
Find-Neighbor Module
Prediction Results
Prediction Function
Cost Model
24
Design - 3
• Parameters Collection
- CPU Occupation Time- Disk Occupation Time- Network Occupation Time
COST
MODEL
Job Analyzer:Collect
Parameters of Type 2
Static Parameters Collection Module:Collect
Parameters of Type1 & Type 3
- Map execution time- Reduce execution timePrediction
Module
25
Experience
• Task Execution Time (Error Rate)– K=12%, and with w different for each feature– K=12%, and with w the same for each feature– K=25%, and with w different for each feature– 4 kinds of jobs, 64M-8G
1 4 7 10 13 16 19 22 25 28 31 34 37 400
20
40
60
80
100
120
140
160
180
Reduce Tasks
k=12%k=25%k=12%,w=1
Erro
r Rat
e (1
00%
)
1 4 7 10 13 16 19 22 25 28 31 34 37 400
10
20
30
40
50
60
70
80
90
Map Tasks
Erro
r Rat
e (1
00%)
Job ID Job ID
26
Conclusion
• Job Analyzer :– Analyze Job Jar + Input File– Collect parameters
• Prediction Module:– Find the main factor– Propose a linear equation– Job classification– Multiple prediction
27
Thank you!
Question?
28
Cost Model [1]
• Analysis about Reduce- Modeling the resources (CPU Disk Network) consumption- Each stage involves only one type of resources
Initi
ation
Read Data
NetworkTransfer
Create Object
Reduce Function
MergeSort
Read/WriteDisk
Network
Write DiskSerialization
Deserialization
Reduce CPU:Disk:Net:
29
Prediction Model• Main Factors (according to the performance
model)- Reduce Stage
Initi
ation
Read Data
NetworkTransfer
Create Object
Reduce Function
MergeSort
Read/WriteDisk
Network
Write DiskSerialization
Deserialization
Treduce=β0
+β1*MapInput
+β2*N
+β3*Nlog(N)
+β4*The complexity of Reduce function
+β5*The conversion rate of Map data
+β6*The conversion rate of Reduce data
The amount of input data
The number of input records
The complexity of Reduce function
The conversion rate of Map data
NlogN
The conversion rate of Reduce data