SLURM command summary - Slurm Workload Manager Date 10/3/2017 10:17:03 AM
HPC - Science and Technology Facilities Council Technical Consultant Measuring and Optimising Energy...
Transcript of HPC - Science and Technology Facilities Council Technical Consultant Measuring and Optimising Energy...
1© Bull, 2012
MEW23 Liverpool 27th‐28th November 2012
Dr. Dan Kidger
HPC Technical Consultant
Measuring and Optimising Energy Consumption of Batch jobs under BULL SLURM
2© Bull, 2012
Industrial Electricity Prices in Europe
Source Eurostat – year 2010
Electricity Industrial Prices in UE
0,00
0,02
0,04
0,06
0,08
0,10
0,12
1998
2000
2002
2004
2006
2008
2010
€/kW
h
AllemagneEspagneFranceItaliePays-BasPologneRoyaume-UniNorvège
http://epp.eurostat.ec.europa.eu/portal/page/portal/energy/data/main_tables#
Electricity priceshighly variableacross EuropeAvg 0.11€/kWh
Electricity pricesRising steadily
CAGR 12%
3© Bull, 2012
Bull: European leader in mission‐critical digital systems
9,000EXPERTSrecognized worldwidein secure systems
OPERATING IN 50 COUNTRIES
€1.3bn REVENUES
+29%growth in profitabilityin 1st quarter 2012
+4,6%growth in 2011
+23%Efforts in research in 2011
4© Bull, 2012
TERA 100 in figures 1.25 PetaFlops
140 000+ Xeon cores
256 TB memory
30 PB disk storage
500 GB/s IO throughput 580 m² footprint
CURIE in figures 2 PetaFlops
90 000+ Xeon cores148 000 GPU cores
360 TBmemory
10 PB disk storage
250 GB/s IO throughput 200 m² footprint
IFERC in figures 1.5 PetaFlops
70 000+ Xeon cores
280 TBmemory
15 PB disk storage
120 GB/s IO throughput 200 m² footprint
Bull in the Top500
18 systems in the Nov’12 Top500 list : 3 systems above 1 Pflops
5© Bull, 2012
Atomic Weapons Establishment
AWE confirms its trust in Bull with the upgrade
of its 3 bullx supercomputers
New blades in the existing infrastructure
Simple replacement of the initial blades with new bullx B510 blades featuring the latest Sandy Bridge EP CPUs
Willow 2x 35 TflopsWhitebeam 2x 156 TflopsBlackthorn 145 Tflops Sycamore 398 TflopsAll existing bullx chassis re‐used to house the new bladesUpgrade of the storage systemsCluster software upgraded to bullx supercomputer suite 4
3 systems in the top500: Blackthorn, WillowA and WillowB
6© Bull, 2012
This innovative engineering company specializing in design for the motor racing industry wanted to:
Support the use of advanced virtual engineering technologies, developed in-house, for complete simulated vehicle design, development and testing
Solution198 bullx B500 compute blades2 memory rich bullx S6010 compute nodes for pre and post meshing
7© Bull, 2012
bullx supercomputer suite: key values
• Super‐Fast image based provisioning• Web‐based Multi‐level supervision• Power management• Automated health management• Maintenance management
• Super‐Fast image based provisioning• Web‐based Multi‐level supervision• Power management• Automated health management• Maintenance management
bullx MCbullx MC
• Highly available cells based architecture• Increased throughput and scalability• Highly available cells based architecture• Increased throughput and scalabilitybullx PFSbullx PFS
• Advanced placement policies• Topology aware resource allocation• Advanced placement policies• Topology aware resource allocationbullx BM bullx BM
• Multi‐path network failover• Abnormal patterns detection• Topology aware operations
• Multi‐path network failover• Abnormal patterns detection• Topology aware operations
bullx MPIbullx MPI
• Complete best of breed set of tools (from compiling, debugging to profiling and optimizing activities)
• Complete best of breed set of tools (from compiling, debugging to profiling and optimizing activities)bullx DEbullx DE
• HPC Enabled (OS jitter reduction, Optimized operations for increased application performance)
• Enhanced OFED
• HPC Enabled (OS jitter reduction, Optimized operations for increased application performance)
• Enhanced OFED bullx Linuxbullx Linux
Ksis
Lustre
Slurm
OpenMPI
8© Bull, 2012
About Slurm
Originally intended as simple resource manager, but has evolved into sophisticated batch schedulerSimple and small enough for use by Intel for their 48‐core “cluster on a chip”Able to satisfy scheduling requirements for major computer centers with use of optional pluginsNo single point of failure, backup daemons, fault‐tolerant job optionsHighly scalable (1.6M core Bluegene/Q installation at LLNL)Highly portable (autoconf, extensive plugins for various environments)Open source (GPL v2)Operating on many of the world's largest computersAbout 500,000 lines of code today (plus test suite and documentation)
9© Bull, 2012
Power Management with Slurm
Existing energy saving mechanism in SLURMSystem side featureFramework for energy saving through unutilized nodes– Administrator configurable actions (hibernate,frequency scaling, power off,etc)– Automatic “Wake up” when jobs arrive
What can we do ?Make energy saving a User concern:
Monitor and report node and jobs energy consumptionControl over the jobs energy usage
10© Bull, 2012
Task 1: Measuring Energy Consumption
Framework to support the capturing of power/energy consumption from the computing nodes
Captures and reports the per node power/energy consumptionCalculates the per step (job) energy consumption and stores on the DB along with the other execution characteristics
11© Bull, 2012
RAPL – power/energy measurments
• RAPL =“Running Average Power Limit”•Available in Intel SandyBridge onwards•Hardware registers for cumulative energy consumption
•PP0_ENERGY:•energy used by “power plane 0” which includes all cores and caches of a socket
•PP1_ENERGY:•energy used by the ”uncores” (this may include on-chip Intel GPU)
•PACKAGE_ENERGY:•total energy consumed by entire package (PP0 + PP1)
•DRAM_ENERGY:•energy drawn by the memory controller inside the processor chip (the actual power fed into the main memory DIMMs is not included in the current measurment)
12© Bull, 2012
SLURM configuration
Easy configuration through configuration file
Power mesures are reported through scontrol
scontrol show nodeNodeName=berlin47 Arch=x86_64 CoresPerSocket=8
CPUAlloc=0 CPUErr=0 CPUTot=32 Features=(null)Gres=(null)NodeAddr=berlin47 NodeHostName=berlin47OS=Linux RealMemory=1 Sockets=2 Boards=1State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1BootTime=2012-09-28T11:04:32 SlurmdStartTime=2012-10-08T10:59:45CurrentWatts=33 LowestJoules=106789 ConsumedJoules=1652356
#slurm.conf...AcctGatherEnergyType=acct_gather_energy/raplAcctGatherNodeFreq=30
13© Bull, 2012
Task 2: Controlling Energy Consumption
Measuring Energy Consumption and reporting it per step/job level is an important step, but users should have means to influence it as well. So we introduced –cpu‐freq parameter in srunThe user may ask either a particular value in kilohertz or use low/medium/high and the request will match the closest possible numerical value
14© Bull, 2012
Static Frequency Scaling with SLURM jobs
$# srun --cpu-freq=2700000 --resv-ports -N2 -n64 ./cg.C.64&
$#sacct -j 58 -format=jobid,elapsed,aveCPUFreq,consumedenergyJobID Elapsed AveCPUFreq ConsumedEnergy
------------ ---------- ---------- --------------66 00:00:49 2640340 19668
Effective CPU FrequencyJob Power consumption
15© Bull, 2012
Case Study : Conjugant Gradient Solver
AverageCPU Elapsed Time Consumed Energy(J)Frequency
1200000 00:01:35 193661396460 00:01:23 190181780477 00:01:09 193531996186 00:01:05 198172200000 00:01:02 204942362500 00:00:59 21408
2653125 00:00:56 23125
0
0.2
0.4
0.6
0.8
1
1200000
1396460
1780477
1996186
2200000
2362500
2653125
Ratio Time / Energy
$#srun --cpu-freq=2700000 --resv-ports –N2 -n64 ./cg.C.64
0
20
40
60
80
100
1.00 1.50 2.00 2.50 3.00
wallclock (s)
wallclock (s)
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
1.00 1.50 2.00 2.50 3.00
Joules
17© Bull, 2012
SLURM Project Team
Research and Development:‐Dan Rusak(Bull, USA)‐Don Albert(Bull, USA) ‐Martin Perry (Bull, USA)‐Yiannis Georgiou (Bull, France)‐Xavier Bru (Bull, France)
Design and Integration:‐Nancy Kritkausky (Bull, France)‐Moe Jette (SchedMD, USA)‐Danny Auble (SchedMD, USA)
Research and Design Ideas:‐Matthieu Hautreux (CEA, France)