Measuring Quality of Service on Worker Node in Cluster
description
Transcript of Measuring Quality of Service on Worker Node in Cluster
![Page 1: Measuring Quality of Service on Worker Node in Cluster](https://reader036.fdocuments.us/reader036/viewer/2022062315/56815718550346895dc4b6db/html5/thumbnails/1.jpg)
15/02/2006 CHEP 06 1
Measuring Quality of Service on Worker Node in Cluster
Rohitashva Sharma, R S Mundada, Sonika Sachdeva, P S Dhekne, Computer Division, BARC, Mumbai, India
Helge Mainhard, Tony Cass, Olof Barring, CERN Geneva, Switzerland
![Page 2: Measuring Quality of Service on Worker Node in Cluster](https://reader036.fdocuments.us/reader036/viewer/2022062315/56815718550346895dc4b6db/html5/thumbnails/2.jpg)
15/02/2006 CHEP 06 2
INTRODUCTION
Quality of Service Defines goodness of a node for a type
of task Needed for better/optimum utilization
of resources Computer Division, BARC and IT
Division CERN collaborated to explore ways to predict QoS
![Page 3: Measuring Quality of Service on Worker Node in Cluster](https://reader036.fdocuments.us/reader036/viewer/2022062315/56815718550346895dc4b6db/html5/thumbnails/3.jpg)
15/02/2006 CHEP 06 3
QoS – Definition
QoS defines, how better the node is for a given task
QoS relates execution times like this
QoS varies between 0 to 1
QoS
TT noloadexecution
Texecution = Wall clock execution time for any taskTnoload = Wall clock execution time of the task on a given node without loadQoS = Quality of Service
![Page 4: Measuring Quality of Service on Worker Node in Cluster](https://reader036.fdocuments.us/reader036/viewer/2022062315/56815718550346895dc4b6db/html5/thumbnails/4.jpg)
15/02/2006 CHEP 06 4
Methodology
Three task categories CPU intensive Disk IO intensive Network IO intensive
Representative probe programs for each category
Load generating program for each category
![Page 5: Measuring Quality of Service on Worker Node in Cluster](https://reader036.fdocuments.us/reader036/viewer/2022062315/56815718550346895dc4b6db/html5/thumbnails/5.jpg)
15/02/2006 CHEP 06 5
Methodology
Monitor system metrics Load avg, CPU utilization, Memory utilization,
disk utilization, swap utilization etc. Execute probe programs in different load
conditions (generated using load generating programs)
Correlate probe execution time, system metrics and no load execution time of probe
![Page 6: Measuring Quality of Service on Worker Node in Cluster](https://reader036.fdocuments.us/reader036/viewer/2022062315/56815718550346895dc4b6db/html5/thumbnails/6.jpg)
15/02/2006 CHEP 06 6
Probe Selection
Probe should Represent real world applications Have less execution time Non-interactive
Selected probes are Linpack for CPU intensive Bonnie for Disk IO intensive Network IO intensive (not considered)
![Page 7: Measuring Quality of Service on Worker Node in Cluster](https://reader036.fdocuments.us/reader036/viewer/2022062315/56815718550346895dc4b6db/html5/thumbnails/7.jpg)
15/02/2006 CHEP 06 7
Load Generating programs
Generate load in given category Should have large execution time Feature for varying the load
Two type of Disk IO load Block IO (IO in large data blocks) Character IO (IO in small data blocks)
![Page 8: Measuring Quality of Service on Worker Node in Cluster](https://reader036.fdocuments.us/reader036/viewer/2022062315/56815718550346895dc4b6db/html5/thumbnails/8.jpg)
15/02/2006 CHEP 06 8
SETUP
32 node cluster Each node consists of
[email protected] GHz 640 MB memory 40 GB HDD Redhat Linux version 7.3
EDG Fabric Monitoring System for gathering system metrics
![Page 9: Measuring Quality of Service on Worker Node in Cluster](https://reader036.fdocuments.us/reader036/viewer/2022062315/56815718550346895dc4b6db/html5/thumbnails/9.jpg)
15/02/2006 CHEP 06 9
CPU Probe
CPU probe in different loading conditions Correlation using load average
Execution time varies linearly with load average
Problem in block IO load
eLoadAveragQoS
1
1(Equation 1)
![Page 10: Measuring Quality of Service on Worker Node in Cluster](https://reader036.fdocuments.us/reader036/viewer/2022062315/56815718550346895dc4b6db/html5/thumbnails/10.jpg)
15/02/2006 CHEP 06 10
CPU Probe
Execution Time vs 1 Min. Load Average
0
100
200
300
400
500
600
700
0 500 1000 1500 2000
1 min. Load Average x 100
CPU Load
Char IO Load
Block IO Load
![Page 11: Measuring Quality of Service on Worker Node in Cluster](https://reader036.fdocuments.us/reader036/viewer/2022062315/56815718550346895dc4b6db/html5/thumbnails/11.jpg)
15/02/2006 CHEP 06 11
CPU Probe
Load average represents combined CPU and IO load
CPU probe depends only on CPU load
Two ways to achieve it Average CPU load (VmStatR) Calculate available CPU to probe
![Page 12: Measuring Quality of Service on Worker Node in Cluster](https://reader036.fdocuments.us/reader036/viewer/2022062315/56815718550346895dc4b6db/html5/thumbnails/12.jpg)
15/02/2006 CHEP 06 12
CPU Probe
Average CPU Load 1 minute running average of run queue Called VmStatR Predicted QoS will be
VmStatRQoS
1
1(Equation 2)
![Page 13: Measuring Quality of Service on Worker Node in Cluster](https://reader036.fdocuments.us/reader036/viewer/2022062315/56815718550346895dc4b6db/html5/thumbnails/13.jpg)
15/02/2006 CHEP 06 13
CPU Probe
Execution Time vs VmStatR
0
100
200
300
400
500
600
700
0 200 400 600 800 1000 1200 1400 1600
VmStatR x 100
CPU Load
Char IO Load
Block IO Load
![Page 14: Measuring Quality of Service on Worker Node in Cluster](https://reader036.fdocuments.us/reader036/viewer/2022062315/56815718550346895dc4b6db/html5/thumbnails/14.jpg)
15/02/2006 CHEP 06 14
CPU Probe
Available CPU to probe Calculate using CPU utilization metric Probe is eligible for
Available Idle time A share of System and User time
100
11
VmStatR
SystemTime
VmStatR
UserTimeIdleTime
QoS (Equation 3)
![Page 15: Measuring Quality of Service on Worker Node in Cluster](https://reader036.fdocuments.us/reader036/viewer/2022062315/56815718550346895dc4b6db/html5/thumbnails/15.jpg)
15/02/2006 CHEP 06 15
CPU Probe
Table shows the comparison between QoS predicted using equation 1 & 3 in Block IO load
QoS using Eq. 3 shows correct characteristic
QoS using Equation 1 QoS using Equation 3 Execution Time (Sec)
0.2433 0.4300487 32
0.1605 0.4375441 31
0.1329 0.4624468 32
0.1136 0.415 30
0.1042 0.4536079 31
0.0952 0.4290476 30
0.0869 0.4430435 31
![Page 16: Measuring Quality of Service on Worker Node in Cluster](https://reader036.fdocuments.us/reader036/viewer/2022062315/56815718550346895dc4b6db/html5/thumbnails/16.jpg)
15/02/2006 CHEP 06 16
Comparison of results
Compare the QoS results obtained using the three equations for CPU probe in different loads Equation 1 does not give correct
prediction in block IO load conditions Equation 2 & 3 give acceptable results
in any load condition
![Page 17: Measuring Quality of Service on Worker Node in Cluster](https://reader036.fdocuments.us/reader036/viewer/2022062315/56815718550346895dc4b6db/html5/thumbnails/17.jpg)
15/02/2006 CHEP 06 17
CPU Probe – Comparison of results
Compar ison of the Measured and Predicted Exec T ime
f or CPU Probe
0
20
40
60
80
100
120
140
160
180
1 2 3 4
LC +LB LC LC +LC h
LC h+LB
Measur ed E xec T ime
E quation 1
E quation 2
E quation 3
LC – CPU LoadLC+LB – CPU + Block IO LoadLC + LCh – CPU + Character IO LoadLCh + LB – Character + Block IO Load
![Page 18: Measuring Quality of Service on Worker Node in Cluster](https://reader036.fdocuments.us/reader036/viewer/2022062315/56815718550346895dc4b6db/html5/thumbnails/18.jpg)
15/02/2006 CHEP 06 18
Disk IO Probe
Modified ‘Bonnie’ to perform both as block IO and character IO probe
Considered block IO probe as most of the applications were block IO intensive
Correlate execution time probe under different loading conditions
Predicted QoS using the three equations and compared results
![Page 19: Measuring Quality of Service on Worker Node in Cluster](https://reader036.fdocuments.us/reader036/viewer/2022062315/56815718550346895dc4b6db/html5/thumbnails/19.jpg)
15/02/2006 CHEP 06 19
Disk IO Probe – Comparison of results
Compar ison of Measured and Predicted Execution T ime of
Block IO Probe
0
5
10
15
20
25
30
35
40
1 2 3 4LC +LB LC LC +LC h
LC h+LB
Measur ed E xec T ime
E quation 1
E quation 2
E quation 3
LC – CPU LoadLC+LB – CPU + Block IO LoadLC + LCh – CPU + Character IO LoadLCh + LB – Character + Block IO Load
![Page 20: Measuring Quality of Service on Worker Node in Cluster](https://reader036.fdocuments.us/reader036/viewer/2022062315/56815718550346895dc4b6db/html5/thumbnails/20.jpg)
15/02/2006 CHEP 06 20
CMSIM Results
Predicted execution time using QoS from Equation 2
% error against the measured one acceptable
Measured Execution Time (Sec) Predicted Execution Time (Sec) % Error
585 610.8687 4.422
739 744.3209 0.720017
929 934.466 0.588377
1082 1080.702 -0.11999
1230 1216.43 -1.10328
1413 1381.166 -2.25294
1687 1707.317 1.204332
![Page 21: Measuring Quality of Service on Worker Node in Cluster](https://reader036.fdocuments.us/reader036/viewer/2022062315/56815718550346895dc4b6db/html5/thumbnails/21.jpg)
15/02/2006 CHEP 06 21
Problem Areas
Effect of swapping If available memory is less than the size
of task Linux kernel dynamically changes the
priorities of tasks and swaps tasks accordingly
Difficult to predict QoS
![Page 22: Measuring Quality of Service on Worker Node in Cluster](https://reader036.fdocuments.us/reader036/viewer/2022062315/56815718550346895dc4b6db/html5/thumbnails/22.jpg)
15/02/2006 CHEP 06 22
Problem Areas – Swapping
V ar i ati on of M emor y, Swap and E xec T i me
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14
S ample No.
0
50
100
150
200
250
300
350
400
% Used Memory
% Used Swap
Exec T ime
![Page 23: Measuring Quality of Service on Worker Node in Cluster](https://reader036.fdocuments.us/reader036/viewer/2022062315/56815718550346895dc4b6db/html5/thumbnails/23.jpg)
15/02/2006 CHEP 06 23
Problem Areas
Metric sampling frequency of monitoring system Immediate metric value ensures better QoS
prediction At higher sampling frequency monitoring loads
the node Change in state after submission of task
QoS can’t consider load changes after submission of task
Submission/removal of other task may change QoS
![Page 24: Measuring Quality of Service on Worker Node in Cluster](https://reader036.fdocuments.us/reader036/viewer/2022062315/56815718550346895dc4b6db/html5/thumbnails/24.jpg)
15/02/2006 CHEP 06 24
Conclusion
Equation 2 & 3 provides better QoS for CPU bound applications
Equation 1 can be used for IO bound applications
Successfully predicted for CMSIM – It is mostly cpu bound job
Load balancing programs can use derived equations for job submissions
![Page 25: Measuring Quality of Service on Worker Node in Cluster](https://reader036.fdocuments.us/reader036/viewer/2022062315/56815718550346895dc4b6db/html5/thumbnails/25.jpg)
15/02/2006 CHEP 06 25
Thanks