PowerAPI usage proposal v1 · (SLURM, PBSPro) Application Manager (GEOPM) System Components WG 2 WG...
Transcript of PowerAPI usage proposal v1 · (SLURM, PBSPro) Application Manager (GEOPM) System Components WG 2 WG...
OCCCPU
Node
IB pathway 1Tighter workload coupling
Performance concernsJitter concerns
Data AvailableDetailed Profiling Data such as
component power, temp, utilization
Telemetry Data – In Band vs Out of Band
OOB pathwayMeasure without affecting system performanceTime frame is more coarse (seconds vs ms)Difficult to correlate to system jobs sections
Main Memory
IPMI – PolledComponent Power/Temp/Fans
via sensorsNode Power/Inlet temp
REST/Redfish – PolledSimilar sensor data
AMESTER - PolledDetailed tracing from OCC
Crassd – Pushed via web socketComponent temps and powers
pushed every X secondsAlerts with Trip Limits
Host OS
Power Measurement hardware and
GPUsBMC
GPUsOCC
CPU
Host OS
Node
Measurement – In Band Sensors
BMC/FSP
Main Memory
1. 40+ different types of sensor readings• Power, performance, utilization
2. 400+ total sensors (24 cores scales this up)3. Timestamped with the system timestamp (will be used by GPU as well).4. Accumulator and update tag support for energy calculations5. Min/max support – and clearing of min/max6. 4KB pushed up every 10ms7. All sensors updated every 100ms8. Read by the o/s, CSM, job profiler9. In standard lm_sensor linux utility
© 2011 IBM Corporation4
AMESTER: Automated Measurement of Systems for Energy and Temperature Reporting – Out of Band
§ Sensor data collection– Whole system power measurement
Component power on POWER: Processor Vdd, Memory, IO, Storage, Fans
– Core: temperature, frequency, utilization, memory bandwidth, instructions per second.
– Histograms
§ 500 µs tracing into 8KB buffer– E.g.1 sensor for 1 second
§ 2 ms tracing into 8KB buffer– E.g. 1 sensor for 8 seconds
§ Amester is a Tcl interpreter– All functions of Amester can be scripted by
Tcl scripts– Data library to collect and graph user data– Output to CSV file for import into Excel or
Matlab
OCCCPU
Compute Nodes(300+)
Cluster RAS Service Daemon (crassd)
BMC
Main Memory
GPUs
Service Node
1. CSM collects sensor data every 10 min from the os using via a shared memory read
2. Read CPU, GPU, Memory Temps, throttled3. Read CPU, GPU Energy Consumption4. CSM puts sensor data into BDS
1. ibm-crassd uses web sockets2. BMC pushes sensors every 1s3. Only collect changed sensors
to reduce bandwidth4. CPU temps, Dimm temps, Fan
power, VRM currents, throttled5. Can use trip limits
Storage Node
Big Data Store
User
Pump controller,
Prometheus, etc.
1. Ibm-crassd plans to aggregate data into 1 hour max, min, average
2. Ibm-crassd sends these aggregated values to BDS
Ibm-crassdService CSM
User collects whatever it needs for its specialized purposes. Data is hosted on a socket
CSM
Host OS
OCCCPU
Compute Node
PowerAPI inclusion
Main Memory
Various OOB Data Collection MethodsIPMI – Polled
REST/Redfish – PolledAMESTER - Polled
Crassd – Pushed via web socket
Host OS
Power Measurement hardware and
GPUs
BMC
In Band collectionStandard linux lm_sensor
User Console or BDS or Sys Admin. Wherever the data is going to be consumed for Telemetry or for taking cluster level actions.
Lm_sensor to PowerAPI
converter/plug inRedfish, IPMI, etc to PowerAPIconverter/plug ins
PowerAPI PowerAPI
Power/thermal database
Power API ServerService Node or JSRM or ?
Site Admin / Facility Managers
(State Manager)
WG1
Site/Facility-level Tools(Cluster Manager)
Facility/Mechanical System(Power-gen, chiller,…)
Job Scheduler Resource Manager
(SLURM, PBSPro)
Application Manager (GEOPM)
System Components
WG2
WG3
In-BandOut-of-Band
E.g. GEOPM PlatformIOPortable Solution: Power API
Platform Manager
ACPI, OPAL, IPMI, msr-safe
(Variorum, PowerAPI, PAPI, HDEEM)
Job DB
Sub Module #0(could be hierarchical)
App profile(min/max/avg.freq. impact)
Sub Module #1…
Statechange Action
Config. (Power API)
Current
Sub-state
-- STATES --1. Normal Operation2. Benchmarking3. Maintenance4. Emergency5. :
System Characterization
(PowerAPI)
Cluster DB
Config. (Power API)
Override
EAS
ContractProcurement document
Policies(for state n)
HPC PowerStack High Level Flow