Andrea Bartolini, ProfUniversity of Bologna – DEI, Italy
Join the Conversation #OpenPOWERSummit
Combine Out-of-band monitoring with AI and big data for datacenter automation in OpenPOWER
Outline
• Datacenter Automation
• D.A.V.I.D.E. Out-of-Band and Big Data Monitoring
• AI-based Anomaly Detection
• Future Works
PerformanceAnalysis
Scalable MoniitoringFramework
MachineLearning
DataVisualization
Resources Management
Energy efficiency
JobScheduling
Heterogeneous Sensors
Common Interface
CRAC
PDU
CLUSTERReactive and Proactive
Feedbacks
ENV.
A New Trend: Datacentre Automation
Usage Scenarios
Fine Grain Power and Performance Measurements:
- Verify and classify node performance- In spec / out of spec behavior- Aging and wear out
- Predictive maintenance
- Per user - Energy / Performance – accounting
Coarse grain
Fine grain
CPU
CPU
ACC ACC
Node
DIMMDIMMDIMM
req
req
util
Job Scheduler
System Power Capping
- New Installations, Grid SLA, Power Shortage, Natural Disasters
- Ensures operating power below a maximum power consumption level
5
A STAR IS
BORN
D.A.V.I.D.E. SUPERCOMPUTE
R(Development of an
Added Value
Infrastructure Designed in
Europe)
D.A.V.I.D.E. Supercomputer• Power architecture & NVIDIA Tesla Pascal GPUs
• 45 nodes
• Best-in-class components plus custom HW
• Innovative middleware SW system
• Peak performance: 990 TFlops
• #440 in Top500 and #18 in Green500 (November 2017)
• Power consumption: <= 2Kwatt per node
• Direct hot-water (27°C) liquid cooling
• Each rack has an independent heat exchanger with redundant pumps
D.A.V.I.D.E. SUPERCOMPUTER(Development of an Added Value Infrastructure Designed in Europe)
OCP form-factor compute node
based on IBM Minsky
2xIB
EDR
LIQUID COOLING
4x Tesla P100
HSMX2
University of Bologna
FINE GRAIN POWER AND
PERFORMANCE
MONITORING
BusBar
2 x POWER8 with
NVLink
Out-of-band Power Monitoring
• Power measuring block - placed between the power sensing unit (PSU) and the DC-DC converter provide overall node power consumption
• Embedded system (BeagleBoneBlack) that can support edge computing/learning
• Scalable interface to the data analysis point through the MQTT protocol
Architecture
Out-of-band Power Monitoring
• Platform independent (Intel, IBM, ARM)
• Sub-Watt precision• Sampling rate
@50kS/s (T=20us)
State-of-the art systems (Bull-HDEEM and PowerInsight)• Max. 1 ms sampling period• Use data only offline (no possibility for real-time computing)
Architecture
Out-of-band Power Monitoring
Beaglebone Black connectors on PDB
Beaglebone Black
• Directly integrated into the Power Distribution Board (PDB)
• Out-of-band Power Monitoring with sampling rate up to 50kS/s per channel (800KS/s per channel, decimated in HW to 50kS/s)
• 12bit 8-channel SAR ADC
• PTP HW enabled samples can be synchronized in the range of few micro-seconds
BBB_node_pubMQTT publisher
Sampling Thread
BBBADC
MQTT pub
Back-end: BBB control software
• Interface with the internal BBB ADC subsystem
• Read Voltages and Currents @800KS/s
• Data @50KS/s readable directly from the BBB
• Interface with OpenPower:• BMC• OCC• APSS
• Publish data to MQTT brokers on topics characterized by sampling rate:
• (V,I)@1second sampling rate
• (V,I)@1ms sampling rate
APSS, OCC, BMCSensors
Measurements Synchronization
Considering sampling period of 20us [1]:
• NTP allows measurements synchronization between nodes with uncertainty of few samples (99% of the samples within 35us) perfect for topics @1s, @1ms
• PTP allows measurements synchronization with time uncertainty smaller than inter-sample time (few us of uncertainty).
3
NTP PTP
[1] Evaluation of Synchronization Protocols for fine-grain HPC sensor data time-stamping and Collection, Libri et. all, HPCS 2016
13
Application 1
Application 2
How to do that in
real-time for a
football-sized cluster
of computing nodes?
Real-time Frequency analysis on power supply and more…a live oscilloscope• For instance, using the FFT we plot the power spectral density of the power
benchmark of two applications, and we can distinguish them by the harmonics present in each of the signals
Framework Fsmax [kHz]
E4 PPBB 50
HDEEM 1
PowerInsight 1
Low overhead, accurate monitoring
Scalable Data Collection, Analytics
Sens_pub
Broker1
Sens_pub Sens_pub
Cassandranode1
MQTT
Sens_pub
BrokerM
Sens_pub Sens_pub
CassandranodeM
Grafana
Back-end• MQTT–enabled sensor
collectors
Front-end • MQTT Brokers• Data Visualization• NoSQL Storage• Big Data Analytics
ApacheSpark
Target Facility
MQTT Brokers
Applications
NoSQL
ADMIN
MQTT2Kairos MQTT2kairos
Kairosdb
Python Matlab
MQTT: MQ Telemetry Transport
•Lightweight message queuing and transport protocol•Developed by IBM and Eurotech•Well suited for low resource demanding scenarios like M2M, WSN and IoT applications•Basic features:
•PubSub model•Async communication protocol (messages)•Low overhead packet (2 bytes header)•QoS (3 levels)•Open source implementation:
•https://mosquitto.org/
PublisherTopic
(Broker)Subscriber
(mosquitto_pub) (mosquitto_sub)(mosquitto)
Cassandra Column Family
MQTT Publishers
faci
lity/
sen
sors
/B
Sens_pub_A Sens_pub_B Sens_pub_CMetric:
ATags:
facilitySensors
Metric:B
Tags:FacilitySensors
Metric:C
Tags:facilitysensors
facility/sensors/# MQTT2KairosdbMQTTBroker
MQTT to NoSQL Storage: MQTT2Kairosdb= {Value;Timestamp}
Examon Analytics: Batch & Streaming
examon-client(REST)
(Batch)
Pandasdataframe
Bahir-mqtt(Spark connector) (Streaming)
Front-end: Openstack infrastructure
• The Front-end can be deployed on the service node using the Openstackframework
• VMs run services:• Proxy• Brokers• Cassandra• Apache Spark
• All services are horizontally scalable
Volume
(256Gb)
Cass00
Volume
(256Gb)
Volume
(256Gb)
Kairosdb
Cassandra
Cass01
Cassandra
Cass03
Cassandra
Spark00
Spark
Proxy
Grafana
Broker
OpenStack (Service Node)
CLUSTER
Node
BBBNode_p
ub
Node
BBBNode_p
ub
Node
BBBNode_p
ub
Node
BBBNode_pub
MQTT
D.A.V.I.D.E. configuration
Broker0
Cassandranode0
MQTT
Broker3
Grafana
Compute Nodes•45 compute nodes•Hosts: davide1..davide45
Frontend•Host: davidefe01•Docker containers•~45K points/s
Submit jobsApplications
ADMIN
KairosDBnode0
Broker1 Broker2
IPMI, OCCdavide1-45
BBB powerdavide1-15
BBB powerdavide16-30
BBB powerdavide31-45
MQTT2Kairos
ipmi_pub liteon_pub
Job_collector(Aggregators)
Jobs Node
SLURM Extended
get_job_energy
Get energy
Users
asetek_pub
Historical Data DL Modeltrain
Embedded Board: Monitoring + Anomaly Detection
Monitoring Infrastructure
collect data
1
2
3
4
Computing Node 1
Embedded Board
Computing Node 2
Embedded Board
Computing Node N
…
load trained model in boards
Normal Behaviour
Anomaly
DL
online anomaly detection on live, new
data
AI+Big Data on D.A.V.I.D.E.: Anomaly detection
AI+Big Data on D.A.V.I.D.E.: Anomaly detection
Fault!!
Autoencoder basedanomaly detection
Conclusion & Future Works
• We presented an approach to conbine out-of-band monitorng and big data and AI to enable Datacenter Automation
• We proof the effectivity of our approach toward enabling automated anomaly detection of computing node
• Future Works: • Extending the approach toward Security and house-keeping tasks
in Datacenters
• Exploit OpenBMC and OpenHW to lower the cost of the approach
• Looking for partnership for bringing it to P9 systems
ACKNOWLEDGE
The Datacenter Automation TEAM
• Luca Benini, Michela Milano, Andrea Borghesi, Antonio Libri, Francesco Beneventi, Alessandro Petrella
Top Related