PowerAPI usage proposal v1 · (SLURM, PBSPro) Application Manager (GEOPM) System Components WG 2 WG...

8
© 2017 IBM Corporation PowerAPI Proposal Todd Rosedahl, IBM Ryan Grant, SNL

Transcript of PowerAPI usage proposal v1 · (SLURM, PBSPro) Application Manager (GEOPM) System Components WG 2 WG...

© 2017 IBM Corporation

PowerAPI Proposal

Todd Rosedahl, IBM

Ryan Grant, SNL

OCCCPU

Node

IB pathway 1Tighter workload coupling

Performance concernsJitter concerns

Data AvailableDetailed Profiling Data such as

component power, temp, utilization

Telemetry Data – In Band vs Out of Band

OOB pathwayMeasure without affecting system performanceTime frame is more coarse (seconds vs ms)Difficult to correlate to system jobs sections

Main Memory

IPMI – PolledComponent Power/Temp/Fans

via sensorsNode Power/Inlet temp

REST/Redfish – PolledSimilar sensor data

AMESTER - PolledDetailed tracing from OCC

Crassd – Pushed via web socketComponent temps and powers

pushed every X secondsAlerts with Trip Limits

Host OS

Power Measurement hardware and

GPUsBMC

GPUsOCC

CPU

Host OS

Node

Measurement – In Band Sensors

BMC/FSP

Main Memory

1. 40+ different types of sensor readings• Power, performance, utilization

2. 400+ total sensors (24 cores scales this up)3. Timestamped with the system timestamp (will be used by GPU as well).4. Accumulator and update tag support for energy calculations5. Min/max support – and clearing of min/max6. 4KB pushed up every 10ms7. All sensors updated every 100ms8. Read by the o/s, CSM, job profiler9. In standard lm_sensor linux utility

© 2011 IBM Corporation4

AMESTER: Automated Measurement of Systems for Energy and Temperature Reporting – Out of Band

§ Sensor data collection– Whole system power measurement

Component power on POWER: Processor Vdd, Memory, IO, Storage, Fans

– Core: temperature, frequency, utilization, memory bandwidth, instructions per second.

– Histograms

§ 500 µs tracing into 8KB buffer– E.g.1 sensor for 1 second

§ 2 ms tracing into 8KB buffer– E.g. 1 sensor for 8 seconds

§ Amester is a Tcl interpreter– All functions of Amester can be scripted by

Tcl scripts– Data library to collect and graph user data– Output to CSV file for import into Excel or

Matlab

OCCCPU

Compute Nodes(300+)

Cluster RAS Service Daemon (crassd)

BMC

Main Memory

GPUs

Service Node

1. CSM collects sensor data every 10 min from the os using via a shared memory read

2. Read CPU, GPU, Memory Temps, throttled3. Read CPU, GPU Energy Consumption4. CSM puts sensor data into BDS

1. ibm-crassd uses web sockets2. BMC pushes sensors every 1s3. Only collect changed sensors

to reduce bandwidth4. CPU temps, Dimm temps, Fan

power, VRM currents, throttled5. Can use trip limits

Storage Node

Big Data Store

User

Pump controller,

Prometheus, etc.

1. Ibm-crassd plans to aggregate data into 1 hour max, min, average

2. Ibm-crassd sends these aggregated values to BDS

Ibm-crassdService CSM

User collects whatever it needs for its specialized purposes. Data is hosted on a socket

CSM

Host OS

Example Output from AC922 using ibm-crassdand Netdata

OCCCPU

Compute Node

PowerAPI inclusion

Main Memory

Various OOB Data Collection MethodsIPMI – Polled

REST/Redfish – PolledAMESTER - Polled

Crassd – Pushed via web socket

Host OS

Power Measurement hardware and

GPUs

BMC

In Band collectionStandard linux lm_sensor

User Console or BDS or Sys Admin. Wherever the data is going to be consumed for Telemetry or for taking cluster level actions.

Lm_sensor to PowerAPI

converter/plug inRedfish, IPMI, etc to PowerAPIconverter/plug ins

PowerAPI PowerAPI

Power/thermal database

Power API ServerService Node or JSRM or ?

Site Admin / Facility Managers

(State Manager)

WG1

Site/Facility-level Tools(Cluster Manager)

Facility/Mechanical System(Power-gen, chiller,…)

Job Scheduler Resource Manager

(SLURM, PBSPro)

Application Manager (GEOPM)

System Components

WG2

WG3

In-BandOut-of-Band

E.g. GEOPM PlatformIOPortable Solution: Power API

Platform Manager

ACPI, OPAL, IPMI, msr-safe

(Variorum, PowerAPI, PAPI, HDEEM)

Job DB

Sub Module #0(could be hierarchical)

App profile(min/max/avg.freq. impact)

Sub Module #1…

Statechange Action

Config. (Power API)

Current

Sub-state

-- STATES --1. Normal Operation2. Benchmarking3. Maintenance4. Emergency5. :

System Characterization

(PowerAPI)

Cluster DB

Config. (Power API)

Override

EAS

ContractProcurement document

Policies(for state n)

HPC PowerStack High Level Flow