Analyzing Real Cluster Data for Formulating Allocation ... · Need for Models in Computer Science...
Transcript of Analyzing Real Cluster Data for Formulating Allocation ... · Need for Models in Computer Science...
Olivier Beaumont, Lionel Eyraud-Dubois and Juan Angel Lorenzo del Castillo October 24, 2014
Analyzing Real Cluster Data for FormulatingAllocation Algorithms in Cloud PlatformsInria Bordeaux - Sud-Ouest
Need for Models in Computer Science
Lack of real usage data from Cloud infrastructures(privacy, lock-in, scale...)
I Google recently (Nov. 2011) released a trace of usage datafrom one of its (huge) clusters.
I There are other few traces available, but none of them are sodetailed.
Olivier Beaumont et al. - Analyzing Cluster Data for Allocation Algorithms in Clouds. October 24, 2014 - 2
The Problem
Resource allocation algorithms for Cloud Computingor How to allocate a set of services or virtual machines (VMs) on a set of
physical machines (PMs)
I Objective: Optimize resource usage, maximize QoS, ensure SLAs,limit number of migrations...
I Diverse approaches: Online/offline Bin-Packing algorithms (FirstFit, Best Fit). Known to be notoriously difficult.
I Most relevant aspects?: Dynamicity, fault tolerance,multidimensional resources, additional user-supplied constraints, ...
No consensus on the algorithmic models
Olivier Beaumont et al. - Analyzing Cluster Data for Allocation Algorithms in Clouds. October 24, 2014 - 3
Objective
1. Find new characteristics of the trace and exhibit the mainproperties of its jobs.
2. Propose a set of very few parameters that account for themain characteristics of the trace.
Ultimately, this work aims at:• Leveraging the design of efficient allocation algorithms• Fostering the generation of realistic random traces
Olivier Beaumont et al. - Analyzing Cluster Data for Allocation Algorithms in Clouds. October 24, 2014 - 4
Objective
1. Find new characteristics of the trace and exhibit the mainproperties of its jobs.
2. Propose a set of very few parameters that account for themain characteristics of the trace.
Ultimately, this work aims at:• Leveraging the design of efficient allocation algorithms• Fostering the generation of realistic random traces
Olivier Beaumont et al. - Analyzing Cluster Data for Allocation Algorithms in Clouds. October 24, 2014 - 4
The Google Cluster Trace
186 GB of data Detailed workload. ∼700000 jobs
. Each job, multiple tasks onLXCs (Linux Containers)
12583 heterogeneousmachines
. Each task assigned to a singlephysical machine
Exhaustive information. Actual CPU and memory usage per
task, execution time, job priorities...
. Collected during 29 days on 5-minutemonitoring intervals
Priority groups. Low-priority tasks can be
evicted/migrated in favor ofhigher-priority ones
Olivier Beaumont et al. - Analyzing Cluster Data for Allocation Algorithms in Clouds. October 24, 2014 - 5
Important Questions
Designing efficient trace models raises several questions:
Profile Model Premise Questions
Static Small set of parameters Set of jobs representative ofthe whole trace usage?
Dynamics Statistical prediction.Re-computation frequency.
Variation of jobs over time? Lifespandistribution? Variation on usage pat-terns?
AdvancedFeatures
Multi-dimensional Bin-Packing problems simplifiedif dimensions correlated.
Correlation between jobs’ dimensions(CPU, memory)?
FaultTolerance
Quantification of qualitiesand limits of a model
Frequency of failures? Correlated orindependent?
Olivier Beaumont et al. - Analyzing Cluster Data for Allocation Algorithms in Clouds. October 24, 2014 - 6
Dominant Jobs
Can we find a set of jobs representative of the whole trace?
I 3.8% of all jobs accounts for 94.7% ofCPU and 90% of memory usage.
I Large number of tasks per job (470 onaverage).
I Most usage focused on NormalProduction jobs.
I Similar number of tasks in NormalProduction (27%) and Gratis (21%) jobs.
Average resource usage of dominant jobsstacked by priority class.
The rest of this work focuses on this set of Dominant Jobs
Olivier Beaumont et al. - Analyzing Cluster Data for Allocation Algorithms in Clouds. October 24, 2014 - 7
Dominant Jobs
Can we find a set of jobs representative of the whole trace?
I 3.8% of all jobs accounts for 94.7% ofCPU and 90% of memory usage.
I Large number of tasks per job (470 onaverage).
I Most usage focused on NormalProduction jobs.
I Similar number of tasks in NormalProduction (27%) and Gratis (21%) jobs.
Average resource usage of dominant jobsstacked by priority class.
The rest of this work focuses on this set of Dominant Jobs
Olivier Beaumont et al. - Analyzing Cluster Data for Allocation Algorithms in Clouds. October 24, 2014 - 7
Workload Characterization
How is the resource usage of dominant jobs distributed?I Can it be easily modeled?
Are the resource usage dimensions correlated?I Fewer parameters reduce the complexity of packing algorithms
Are Dominant Jobs stable over time?Do they exhibit any patterns?
I To estimate the stability w.r.t. their priorityI So that the resource usage at a given time can be predicted
Olivier Beaumont et al. - Analyzing Cluster Data for Allocation Algorithms in Clouds. October 24, 2014 - 8
Workload CharacterizationHow is the resource usage of jobs distributed?
I It can be modeled by a mixture of two lognormal distributions.I Most of usage by Normal Production jobs. Some jobs in Gratis and
Other use more resources punctually.
Gratis NormProduction Other
0.0
0.1
0.2
0.3
0.4
−2 0 2 4 6 −2 0 2 4 6 −2 0 2 4 6Log of CPU usage
Den
sity
Priority class
Gratis
NormProduction
Other
Distribution
Bimodal
Data
Gaussian
Distribution of CPU usage
Gratis NormProduction Other
0.0
0.1
0.2
0.3
0.4
−2.5 0.0 2.5 −2.5 0.0 2.5 −2.5 0.0 2.5Log of memory usage
Den
sity
Priority class
Gratis
NormProduction
Other
Distribution
Bimodal
Data
Gaussian
Distribution of memory usage
Olivier Beaumont et al. - Analyzing Cluster Data for Allocation Algorithms in Clouds. October 24, 2014 - 9
Workload CharacterizationFor a given job, does memory usage of tasks
depend on its CPU usage?
1. Dominant Jobs sampled on 20 random timestamps per day2. For each job, tasks clustered into groups attending to CPU-memory usage3. Linear regression on each cluster of tasks4. Well-fitted jobs classified into Flat (memory constant) or Slope (memory
affine to CPU)67% 6.5% 26.5% BadlyFitted
CPU (x-axis) vs Memory (y-axis) of four jobs with different usage patterns. Each dot represents a task.
Conclusion: Memory usage w.r.t. CPU can be modeled as affine or constantfor most of jobs (> 70%)
Olivier Beaumont et al. - Analyzing Cluster Data for Allocation Algorithms in Clouds. October 24, 2014 - 10
Workload CharacterizationFor a given job, does memory usage of tasks
depend on its CPU usage?1. Dominant Jobs sampled on 20 random timestamps per day2. For each job, tasks clustered into groups attending to CPU-memory usage3. Linear regression on each cluster of tasks4. Well-fitted jobs classified into Flat (memory constant) or Slope (memory
affine to CPU)
67% 6.5% 26.5% BadlyFitted
CPU (x-axis) vs Memory (y-axis) of four jobs with different usage patterns. Each dot represents a task.
Conclusion: Memory usage w.r.t. CPU can be modeled as affine or constantfor most of jobs (> 70%)
Olivier Beaumont et al. - Analyzing Cluster Data for Allocation Algorithms in Clouds. October 24, 2014 - 10
Workload CharacterizationFor a given job, does memory usage of tasks
depend on its CPU usage?1. Dominant Jobs sampled on 20 random timestamps per day2. For each job, tasks clustered into groups attending to CPU-memory usage3. Linear regression on each cluster of tasks4. Well-fitted jobs classified into Flat (memory constant) or Slope (memory
affine to CPU)67% 6.5% 26.5% BadlyFitted
CPU (x-axis) vs Memory (y-axis) of four jobs with different usage patterns. Each dot represents a task.
Conclusion: Memory usage w.r.t. CPU can be modeled as affine or constantfor most of jobs (> 70%)
Olivier Beaumont et al. - Analyzing Cluster Data for Allocation Algorithms in Clouds. October 24, 2014 - 10
Workload CharacterizationFor a given job, does memory usage of tasks
depend on its CPU usage?1. Dominant Jobs sampled on 20 random timestamps per day2. For each job, tasks clustered into groups attending to CPU-memory usage3. Linear regression on each cluster of tasks4. Well-fitted jobs classified into Flat (memory constant) or Slope (memory
affine to CPU)67% 6.5% 26.5% BadlyFitted
CPU (x-axis) vs Memory (y-axis) of four jobs with different usage patterns. Each dot represents a task.
Conclusion: Memory usage w.r.t. CPU can be modeled as affine or constantfor most of jobs (> 70%)
Olivier Beaumont et al. - Analyzing Cluster Data for Allocation Algorithms in Clouds. October 24, 2014 - 10
Workload Characterization
Are Dominant Jobs stable over time?
Priority Percentage Duration
Gratis 50%1%
< 25 minutes> 30 hours
Other 50%1%
< 25 minutes> 15 hours
Normal Production 50%15.6%
> 31.7 hourswhole trace
Olivier Beaumont et al. - Analyzing Cluster Data for Allocation Algorithms in Clouds. October 24, 2014 - 11
Workload CharacterizationDo Dominant Jobs exhibit any patterns over time?
Correlation of resource usage over time among jobs is importantfor efficient job allocation
I Two positively correlated jobs (peak at the same time) can be allocatedin different machines to avoid starvation
I Two negatively correlated jobs (peak at different times) can be packedtogether to achieve better resource utilization
We have performed an analysis of the periodicity of the resourceusage of jobs
I The resource usage can be approximated by a periodic functionI Analysis of the main components of the spectrumI Analysis restricted to the dominant Normal Production jobs (long enough)I Harmonics removalI Quantification of amplitude, phase, frequency and background noise
Olivier Beaumont et al. - Analyzing Cluster Data for Allocation Algorithms in Clouds. October 24, 2014 - 12
Workload CharacterizationDo Dominant Jobs exhibit any patterns?
02 05 08 11 14 17 20 23 26 29
Day0
1
2
3
4
5
6
7
Cpu
(cor
e-se
c/se
c)
CPU usage of a Normal Production job with daily and weekly patterns.
Patterns:I >50% jobs strong daily patternsI 67% jobs weekly patterns (5 days high usage, 2 days lower usage)
Phase difference:I >50% jobs: <60 degrees (4 hours) apartI >90% jobs: <120 degrees (8 hours) apart
Olivier Beaumont et al. - Analyzing Cluster Data for Allocation Algorithms in Clouds. October 24, 2014 - 13
Machine Failure Characterization
Can machine failures be modeled after a distribution?
Assumptions:I Machines fail independentlyI Failure probability is constant
Model:I Poisson distribution P(λ)
I λ = average number of failures = 0.97 1e−01
1e+01
1e+03
0 1 2 3 4 5 6 7 8 9Number of events
Num
ber
of ti
me
win
dow
s
Value
Actual
Expected
Distribution of machine removal events
Actual versus expected distribution of failures
Olivier Beaumont et al. - Analyzing Cluster Data for Allocation Algorithms in Clouds. October 24, 2014 - 14
Work Outcome
Modeling the Workload
I How many parameters to include (realism vs. overfitting)?I It depends on the system being modeled
Olivier Beaumont et al. - Analyzing Cluster Data for Allocation Algorithms in Clouds. October 24, 2014 - 15
Work Outcome
Allocation algorithms
1. Focus on jobs, not tasks. Examples:I Job load balanced among its tasksI Job allocation computed globally + greedy allocation for individual
tasks2. Describe jobs with their aggregated CPU and memory3. Consider correlation among dimensions (CPU, memory)4. Consider at least daily and weekly patterns5. Machines can be assumed to have independent failures and a failure rate
of 10−5 per hour
Olivier Beaumont et al. - Analyzing Cluster Data for Allocation Algorithms in Clouds. October 24, 2014 - 16
Future Work
I To propose a complete generating model of the identified parametersI Characterization of machine failures over timeI Design and validation of efficient resource allocation algorithms
I Example: Allocation of services with periodic resource usage byco-locating jobs with compatible peak times
Olivier Beaumont et al. - Analyzing Cluster Data for Allocation Algorithms in Clouds. October 24, 2014 - 17
Olivier Beaumont, Lionel Eyraud-Dubois and Juan Angel Lorenzo del Castillo October 24, 2014
Analyzing Real Cluster Data for FormulatingAllocation Algorithms in Cloud PlatformsInria Bordeaux - Sud-Ouest
Backup Slides
Olivier Beaumont et al. - Analyzing Cluster Data for Allocation Algorithms in Clouds. October 24, 2014 - 19
State transitions for jobs and tasks
State transitions for jobs and tasks.
Source: Google cluster-usage traces format + schemaCharles Reiss, John Wilkes, Joseph HellersteinVersion of 2013.05.06, for trace version 2.Copyright c©2011 Google Inc. All rights reserved.
Olivier Beaumont et al. - Analyzing Cluster Data for Allocation Algorithms in Clouds. October 24, 2014 - 20
Trace Timeline
Mapping of original times to times emitted in the trace.
Source: Google cluster-usage traces format + schemaCharles Reiss, John Wilkes, Joseph HellersteinVersion of 2013.05.06, for trace version 2.Copyright c©2011 Google Inc. All rights reserved.
Olivier Beaumont et al. - Analyzing Cluster Data for Allocation Algorithms in Clouds. October 24, 2014 - 21
Trace Utilization
Mapping of original times to times emitted in the trace.
Source: Towards understanding heterogeneous clouds at scale: Google trace analysisCharles Reiss (UC Berkeley), Alexey Tumanov (CMU), Gregory R. Ganger (CMU),Randy H. Katz (UC Berkeley), Michael A. Kozuch (Intel Labs)Intel Science & Technology Center for Cloud ComputingCarnegie Mellon University
Olivier Beaumont et al. - Analyzing Cluster Data for Allocation Algorithms in Clouds. October 24, 2014 - 22
Tasks per Job
0 5000 10000 15000 20000 25000 30000 35000# tasks
0.0
0.2
0.4
0.6
0.8
1.0
Jobs
CDF of number of tasks.
Olivier Beaumont et al. - Analyzing Cluster Data for Allocation Algorithms in Clouds. October 24, 2014 - 23