Power-Aware Parallel Job Scheduling Maja Etinski Julita Corbalan Jesus Labarta Mateo Valero...
-
Upload
arabella-lynch -
Category
Documents
-
view
212 -
download
0
Transcript of Power-Aware Parallel Job Scheduling Maja Etinski Julita Corbalan Jesus Labarta Mateo Valero...
Power-Aware Parallel Job
Scheduling
Maja Etinski Julita Corbalan Jesus Labarta Mateo Valero
{maja.etinski,julita.corbalan,jesus.labarta,mateo.valero}@bsc.es
EEHiPC'10 2
Power Consumption of Supercomputing Systems
Striving for performance has led to enormous power dissipation of HPC centers (Top500 list)
KWatts
EEHiPC'10 3
Power reduction approaches in HPC
Power reduction approaches
Application level:
- Runtime systems: - exploit certain application characteristics (load imbalance, communication intensive regions)
- based on very fine grain DVFS application
System level:
- Turning off idle nodes: - resource allocation such that there are more completely idle nodes - determining number of online nodes
- Operating system power management via DVFS: - linux governors – per core, unawareness of the rest of the system
- DVFS taking into the account entire system workload?
EEHiPC'10 4
Parallel Job Scheduling
Job scheduler has a global view of the whole system
HPC Job Scheduler
Job submission
Job with its
requirements
Queued jobs
Job Scheduling
Resource Manager
Wait Queue
EEHiPC'10 5
DVFS and Job Scheduling
HPC Job Scheduler
Job submission
Job with its
requirements
Queued jobs
Job Scheduling
Resource
Manager
Wait Queue
Power-Aware
Component
Job CPU frequency assignment
based on goals/constraints
EEHiPC'10 6
Outline
Parallel job scheduling:
short introduction to parallel job scheduling
the EASY backfilling policy
Power and run time modelling:
first we need to understand how frequency scaling affects CPU power dissipation and runtime
Energy-saving parallel job scheduling policies:
Utilization-driven power-aware scheduling
[Maja Etinski, Julita Corbalan, Jesus Labarta, and Mateo Valero. Utilization driven power-aware parallel job scheduling. Energy Aware High Performace Computing Conference, Hamburg, September 2010]
BSLD-driven power-aware scheduling
[Maja Etinski, Julita Corbalan, Jesus Labarta, and Mateo Valero. Bsld threshold driven power management policy for hpc centers. IEEE Parallel and Distributed Processing Symposium, HPPAC Workshops 2010 Atlanta, GA, April 2010]
Power-budgeting:
how to maximize job performance under a given power budget?
[Maja Etinski, Julita Corbalan, Jesus Labarta, and Mateo Valero. Optimizing job performance under a given power
constraint in hpc centers. IEEE International Green Computing Conference, Chicago, IL, August 2010]
EEHiPC'10 7
About Parallel Job Scheduling
Parallel job scheduling can be seen as finding a free rectangle for the job being scheduled:
FCFS policy used in the beginning
Backfilling policies introduced to improve system utilization
Job performance metrics:
Response time: WaitTime(J)+RunTime(J)
Slowdown: (WaitTime(J) + RunTime(J))/RunTime(J)
Bounded Slowdown:
max((WaitTime(J)+Runtime(J))/max(Th,RunTime(J)) ,1)
Time
CPUs
Job 1
Job 2
Job 3
Job 5
Job 4 Job 6
Job Performance
Wait Time Run Time
EEHiPC'10 8
The EASY backfilling policy
Job 2
Job 3
Job 4
Job 1
Arrival of Job 5
Job 5
Job 5Job 5
Job 6
MakeJobReservation(Job5)
BackfillJob(Job6)
Arrival of Job 6
Time
CPUs
Jobs are executed in FCFS order except when the first job in the wait queue can not start Users have to submit an estimation of job's runtime – requested time When the first job in the WQ can not start, a reservation is made for it based on requested
times of running jobs A job is executed before previously arrived ones only if it does not delay the first job in the
wait queue
High-Level DVFS modelling
EEHiPC'10 10
Power ModelCPU power presents one of main system power components
It consists of dynamic and static power:
Pcpu
= Pdynamic
+ Pstatic
Pdynamic
= AcfV2 Pstatic
= α V
Fraction of static in total CPU power is a model parameter:
Pstatic
(Vtop
) = X(Pstatic
(Vtop
) + Pdynamic
(ftop
,Vtop
))
( X = 25% in our experiments )
Average activity factor assumed to be same for all jobs (2.5 times higher than idle activity)
Idle processors: do not consume power/ consume power at the lowest frequency
DVFS gear set :
f 0.80 1.10 1.40 1.70 2.00 2.30
V 1.00 1.10 1.20 1.30 1.40 1.50
Norm(P) 0.28 0.38 0.49 0.63 0.80 1.00
EEHiPC'10 11
Time ModelExecution time dependence on frequency is captured by the following
model:
F(f,ß)=T(f) / T(ftop) = ß(ftop / f -1) + 1
[Hsu,Feng SC05: A Power-Aware Run Time System for High-
Performance Computing]
ß is assumed to have the following normal distributions:
Global application ß depends on communication/computation ratio
Two ß scenarios:
ß is known in advance (at the moment of scheduling)
ß is not known in advance (at the moment of scheduling the worst
case, ß = 1, is assumed )
N(0.3, 0.064)More than 32
N(0.4, 0.01)Between 4 and 32
N(0.5, 0.01)Less or equal to 4
DistributionNumber of CPUs
0.8 1.1 1.4 1.7 2 2.3
0
0.5
1
1.5
2
2.5
Execution time penalty
due to frequency scaling
Frequency
Exe
cutio
n tim
e
ß=0.7ß=0.5ß=0.3
Energy Saving Parallel Job
Scheduling Policies
EEHiPC'10 13
Utilization-Driven Policy Frequency assigned once (at jobs start time) for entire job execution based on system utilization
Utilization is computed for each interval T:
An additional control over system load WQthreshold:
If there are more than WQthreshold jobs in the wait queue no frequency scaling will be applied
Otherwise, job started during interval Jk runs at frequency F
Uk-1
Fk
Ulower
Uupper
flower
fupper
ftop
EEHiPC'10 14
Evaluation
C++ event driven parallel job scheduling simulator has been upgraded
Policy parameters:
utilization thresholds: Ulower = 50% Uupper = 80%
reduced frequencies: flower = 1.4 GHz fupper = 2.0 GHz
utilization computation interval: T = 10 min
wait queue length threshold: WQthreshold = 0, 4, 16, NO - limit
Metric of job performance – Bounded Slowdown:
BSLD at frequency f :
Alvio simulator
Policy parameters
Metric of performance
EEHiPC'10 15
Workloads
Five workloads from production use have been simulated:
Workload - #CPUs Avg Util Avg LR
CTC - 430 70% 1.61 50% 28%
SDSC – 128 85% 8.17 26% 5%
SDSCBlue – 1152 69% 2.31 55% 26%
LLNLThunder – 4008 80% 80.00% 29% 11%
LLNLAtlas – 9216 75% 0.94 26% 19%
%T below Uupper % T below U lower
San Diego Supercomputing Center
-less sequential jobs than CTC-runtime distribution similar
San Diego Supercomputing Center
- no sequential job
Lawrence Livermore National Lab
- small to medium size jobs
LLNLAtlas – 9216 10 – 15 1.08 69LLNLAtlas – 9216 10 – 15 1.08 69LLNLAtlas – 9216 10 – 15 1.08 69LLNLAtlas – 9216 10 – 15 1.08 69
Cornell Theory Center-large jobs with relatively
low level of parallelism
Parallel workload archivehttp://www.cs.huji.ac.il/labs/parallel/workload
Lawrence Livermore National Lab
- large parallel jobs
EEHiPC'10 16
Results: Normalized CPU Energy
CTC SDSC SDSCBlue LLNLThunder LLNLAtlas
80%
82%
84%
86%
88%
90%
92%
94%
96%
98%
100%
WQ
0
4
16
NO
Workload
%E
(idle
=0)
CTC SDSC SDSCBlue LLNLThunder LLNLAtlas
80%
82%
84%
86%
88%
90%
92%
94%
96%
98%
100%
WQ
0
4
16
NO
Workload
%E
(idle
=lo
w)
short wait queues
very similar results for both energy scenarios
savings of not highly loaded workloads up to 12%
EEHiPC'10 17
Results: Normalized Performance
CTC SDSC SDSCBlue LLNLThunder LLNLAtlas
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
WQ0
4
16
NO
Workload
Nor
mal
ized
Ave
rage
BS
LD
CTC SDSC SDSCBlue LLNLThunder LLNLAtlas
0
500
1000
1500
2000
2500
WQ
0
4
16
NO
WorkloadN
umbe
r of
red
uced
jobs
an increase in number of backfilled jobs
high penalty in the least conservative case for highly loaded workload
WQ threshold has almost no impact
EEHiPC'10 18
Average frequency - SDSCBlue
EEHiPC'10 19
BSLD-Driven Policy
Frequency is assigned based on job's
predicted performance
Lower frequency -> longer execution time
-> worse job performance metric
BSLDth controls allowable performance
penalty (“target BSLD”)
In order to be run at lower frequency f a job
has to satisfy BSLD condition at frequency
f: if the job's predicted BSLD at frequency f
is lower than BSLDth than it satisfies the
BSLD condition at frequency f
Predicted BSLD:
WQsize
≤ WQthreshold
NO
YES
f = Flowest
find an allocation Alloc
satisfiesBSLD(Alloc,Ji,f) or
f=Ftop
Job Ji
Run job Ji at frequency f
YES
NOf = next higher frequency
Run Ji at F
top
EEHiPC'10 20
Results: Normalized CPU Energy
Normalized energies in two energy
scenarios behave in the same way
Average savings in the most
aggressive case: 5% - 23%
Difference in savings per workload for
the most conservative and the most
aggressive threshold combinations
goes from 5% (SDSC) to 15%
(LLNLThunder)
WQthreshold
controls DVFS
aggressiveness much better than
BSLDthreshold
BSLDthreshold
has stronger impact when
WQthreshold
is higher
EEHiPC'10 21
Average BSLD
4.66
24.91
5.15 1 1.08
Decrease in performance is proportional to energy savings
Strong impact on
performance in the most
aggressive case
Impact of WQthreshold
higher than of BSLDthreshold
BSLDthreshold
has
stronger impact when
WQthreshold
is higher
EEHiPC'10 22
Reduced jobs (out of 5000)
When load is very high (SDSC) no DVFS is applied
(in order to apply it thresholds have to be set to higher values)
Performance depends on
the number of reduced jobs
It depends on used
frequencies as well
It was remarked that
performance of jobs that
have been run at the
nominal frequency was
affected as well
EEHiPC'10 23
Wait time
Zoom of SDSCBlue wait time behavior
Main problem observed:
-> high impact on wait time
Power-Budgeting Policy
EEHiPC'10 25
PB-Guided Policy: How DVFS
can improve overall job performance
J1
J3
J2
Power
Budget
NO
DVFS
CASE
DVFS
CASE
J4
J5
ftop
Wait Queue:J
2
J1
J3
J4
J5
J1
J3
J2
J4
J5
ftop
flower
penalty in run time due to frequency scaling
but more jobs can run simultaneously
TimeT
1T
2
EEHiPC'10 26
Power Budgeting: PB-Guided PolicyFrequency assignment is guided by predicted job performance and current power draw
Prediction of BSLD when selecting frequency:
BSLD condition:
A job satisfies BSLD condition at reduced frequency f if its predicted BSLD
at the frequency f is lower than current value of the BSLD threshold
The policy is power conservative:
A job will be scheduled at the lowest frequency at which both BSLD condition and power limit
are satisfied
BSLD
threshold
0 Plower P
upper Power Budget
The closer to the PB limit, the
higher the BSLD threshold
The higher the BSLD threshold,
the lower frequency will be selectedPcurrent
EEHiPC'10 27
Power Budgeting: PB-Guided Policy
A job can be scheduled with one of the two functions:
MakeJobReservation(J)
1: scheduled <-- false;
2: shiftInTime <-- 0;
3: nextFinishJob <-- next(OrderedRunningQueue);
4: while( !scheduled)
{
5: f <-- FlowestReduced
6: while(f < Fnominal
)
{
7: Alloc = findAllocation(J,currentTime + shiftInTime,f);
8: if (satisfiesBSLD(Alloc, J, f) and satisfiesPowerLimit(Alloc, J, f) )
9: { schedule(J, Alloc);
10: scheduled <-- true;
11: break; }
12: if (f == Fnominal
)
13: Alloc = findAllocation(J,currentTime + shiftInTime, Fnominal
)
14: if (satisfiesPowerLimit(Alloc, J,Fnominal
))
15: schedule(J, Alloc);
16: break;
17: shiftInTime <-- FinishTime(nextFinishJob) - currentTime;
18: nextFinishJob <-- next(OrderedRunningQueue);
}
BackfillJob(J)
1: f <-- Flowest
2: while(f < Fnominal
)
{
3: Alloc = TryToFindBackfilledAllocation(J,f);
4: if (correct(Alloc) and satisfiesBSLD(Alloc, J,f) and
satisfiesPowerLimit(Alloc,J,f))
5: { schedule(J, Alloc);
6: break; }
7: f <-- nextHigherFrequency
}
8: if (f==Fnominal
)
9: { Alloc = TryToFindBackfilledAllocation(J,Fnominal
);
10: if ((correct(Alloc) and satisfiesPowerLimit(Alloc,J,f))
11: schedule(J, Alloc); }
Power Limit
BSLD Conditionthe lowest frequency that satisfies it will be selected
power budget must not be violated during entire job execution
28
EvaluationPolicy parameters:
Power budget thresholds: Plower = 0.6 , Pupper = 0.9BSLD threshold values which have been used:
BSLDlower = avg(BSLD) without power budgeting BSLDupper = 2* BSLDlower
Power budget set to 80% of the total CPU power consumed by whole system when running at Fnominal
Four workloads from production use have been simulated:
89%80%120 - 25LLNLThunder - 4008
74%69%5.1520 – 25SDSCBlue – 1152
95%85%24.9140 - 45SDSC – 128
72%70%4.6620 - 25CTC – 430
Over PBUtilizationAvg BSLDJobs(K)Workload - # CPUs
EEHiPC'10 29
Baseline Power Budgeting Policy
Power limited without DVFS:
No job will start if it would violate the budget although there are available processors
This case is equal to the EASY scheduling with a smaller machine
Job 2
Job 3
Job 1
Arrival of Job 4
Job 4
Job 4Job 4
Job 5
MakeJobReservation(Job5)
BackfillJob(Job6)
Arrival of Job 5CPUs
Job 6
Arrival of Job 6
Job 6
can not start because of power budget
EEHiPC'10 30
Results: Performance
Oracle case: it is assumed that ß values are known at the scheduling time
PB-guided policy shows better performance for all workloads!
AVG wait time decreases with DVFS under power constraint
EEHiPC'10 31
Results: Normalized CPU Energy (idle=0)
Oracle case: it is assumed that ß values are known at the scheduling time
EEHiPC'10 32
Utilization Over Time
EEHiPC'10 33
Power Budget Consumed
EEHiPC'10 34
Comparison of Unknown and Known ß
no ß ß no ß ß no ß ß no ß ß no ß ß
CTC 0.80 0.79 0.63 0.75 1.50 1.40 0.740 0.738 3924 3923
SDS 0.62 0.76 0.53 0.70 1.60 1.50 0.755 0.744 4103 3944
SDSCBlue 0.86 0.75 0.73 0.76 1.50 1.40 0.727 0.724 3611 3550
LLNLThunder 0.33 0.51 0.28 0.47 2.07 1.90 0.884 0.865 3830 3651
Energy Backfilled
Workload
Avg.BSLD Avg.WT Avg. Freq
Avg.BSLD, Avg.WT and Avg.Energy values are normalized with respect to corresponding baseline values
(EASY-backfilling with power limit and without DVFS)
EEHiPC'10 35
Conclusions
Energy – performance trade-off must be done carefully as DVFS does not affect only job
runtime but it can affect significantly job wait time and additionally decrease job performance Performance-energy trade-off needs to be done at job scheduling level as it affect jobs in the
wait queue and only the scheduler can estimate potential negative impact on queued jobs DVFS application to highly loaded workloads (SDSC) leads to very high performance penalty
Parallel job scheduling policies can be designed such that maximizes job performance
under a given power constraint It has been shown that DVFS can improve performance in power constrained HPC centers
(using lower CPU frequencies allows more job to run simultaneously) It is not necessary to know ß values in advance, moreover assuming the worst case at
scheduling time can give better performance than when they are known in advance
HPPAC 2010 Atlanta
EEHiPC'10 36
Thank you for your attention!