Combine Out-of-band monitoring with AI and big data for ...€¦ · Visualization Resources...

23
Andrea Bartolini, Prof University of Bologna – DEI, Italy Join the Conversation #OpenPOWERSummit Combine Out-of-band monitoring with AI and big data for datacenter automation in OpenPOWER

Transcript of Combine Out-of-band monitoring with AI and big data for ...€¦ · Visualization Resources...

Page 1: Combine Out-of-band monitoring with AI and big data for ...€¦ · Visualization Resources Management Energy efficiency Job Scheduling Heterogeneous Sensors Common Interface CRAC

Andrea Bartolini, ProfUniversity of Bologna – DEI, Italy

Join the Conversation #OpenPOWERSummit

Combine Out-of-band monitoring with AI and big data for datacenter automation in OpenPOWER

Page 2: Combine Out-of-band monitoring with AI and big data for ...€¦ · Visualization Resources Management Energy efficiency Job Scheduling Heterogeneous Sensors Common Interface CRAC

Outline

• Datacenter Automation

• D.A.V.I.D.E. Out-of-Band and Big Data Monitoring

• AI-based Anomaly Detection

• Future Works

Page 3: Combine Out-of-band monitoring with AI and big data for ...€¦ · Visualization Resources Management Energy efficiency Job Scheduling Heterogeneous Sensors Common Interface CRAC

PerformanceAnalysis

Scalable MoniitoringFramework

MachineLearning

DataVisualization

Resources Management

Energy efficiency

JobScheduling

Heterogeneous Sensors

Common Interface

CRAC

PDU

CLUSTERReactive and Proactive

Feedbacks

ENV.

A New Trend: Datacentre Automation

Page 4: Combine Out-of-band monitoring with AI and big data for ...€¦ · Visualization Resources Management Energy efficiency Job Scheduling Heterogeneous Sensors Common Interface CRAC

Usage Scenarios

Fine Grain Power and Performance Measurements:

- Verify and classify node performance- In spec / out of spec behavior- Aging and wear out

- Predictive maintenance

- Per user - Energy / Performance – accounting

Coarse grain

Fine grain

CPU

CPU

ACC ACC

Node

DIMMDIMMDIMM

req

req

util

Job Scheduler

System Power Capping

- New Installations, Grid SLA, Power Shortage, Natural Disasters

- Ensures operating power below a maximum power consumption level

Page 5: Combine Out-of-band monitoring with AI and big data for ...€¦ · Visualization Resources Management Energy efficiency Job Scheduling Heterogeneous Sensors Common Interface CRAC

5

A STAR IS

BORN

D.A.V.I.D.E. SUPERCOMPUTE

R(Development of an

Added Value

Infrastructure Designed in

Europe)

Page 6: Combine Out-of-band monitoring with AI and big data for ...€¦ · Visualization Resources Management Energy efficiency Job Scheduling Heterogeneous Sensors Common Interface CRAC

D.A.V.I.D.E. Supercomputer• Power architecture & NVIDIA Tesla Pascal GPUs

• 45 nodes

• Best-in-class components plus custom HW

• Innovative middleware SW system

• Peak performance: 990 TFlops

• #440 in Top500 and #18 in Green500 (November 2017)

• Power consumption: <= 2Kwatt per node

• Direct hot-water (27°C) liquid cooling

• Each rack has an independent heat exchanger with redundant pumps

Page 7: Combine Out-of-band monitoring with AI and big data for ...€¦ · Visualization Resources Management Energy efficiency Job Scheduling Heterogeneous Sensors Common Interface CRAC

D.A.V.I.D.E. SUPERCOMPUTER(Development of an Added Value Infrastructure Designed in Europe)

OCP form-factor compute node

based on IBM Minsky

2xIB

EDR

LIQUID COOLING

4x Tesla P100

HSMX2

University of Bologna

FINE GRAIN POWER AND

PERFORMANCE

MONITORING

BusBar

2 x POWER8 with

NVLink

Page 8: Combine Out-of-band monitoring with AI and big data for ...€¦ · Visualization Resources Management Energy efficiency Job Scheduling Heterogeneous Sensors Common Interface CRAC

Out-of-band Power Monitoring

• Power measuring block - placed between the power sensing unit (PSU) and the DC-DC converter provide overall node power consumption

• Embedded system (BeagleBoneBlack) that can support edge computing/learning

• Scalable interface to the data analysis point through the MQTT protocol

Architecture

Page 9: Combine Out-of-band monitoring with AI and big data for ...€¦ · Visualization Resources Management Energy efficiency Job Scheduling Heterogeneous Sensors Common Interface CRAC

Out-of-band Power Monitoring

• Platform independent (Intel, IBM, ARM)

• Sub-Watt precision• Sampling rate

@50kS/s (T=20us)

State-of-the art systems (Bull-HDEEM and PowerInsight)• Max. 1 ms sampling period• Use data only offline (no possibility for real-time computing)

Architecture

Page 10: Combine Out-of-band monitoring with AI and big data for ...€¦ · Visualization Resources Management Energy efficiency Job Scheduling Heterogeneous Sensors Common Interface CRAC

Out-of-band Power Monitoring

Beaglebone Black connectors on PDB

Beaglebone Black

• Directly integrated into the Power Distribution Board (PDB)

• Out-of-band Power Monitoring with sampling rate up to 50kS/s per channel (800KS/s per channel, decimated in HW to 50kS/s)

• 12bit 8-channel SAR ADC

• PTP HW enabled samples can be synchronized in the range of few micro-seconds

Page 11: Combine Out-of-band monitoring with AI and big data for ...€¦ · Visualization Resources Management Energy efficiency Job Scheduling Heterogeneous Sensors Common Interface CRAC

BBB_node_pubMQTT publisher

Sampling Thread

BBBADC

MQTT pub

Back-end: BBB control software

• Interface with the internal BBB ADC subsystem

• Read Voltages and Currents @800KS/s

• Data @50KS/s readable directly from the BBB

• Interface with OpenPower:• BMC• OCC• APSS

• Publish data to MQTT brokers on topics characterized by sampling rate:

• (V,I)@1second sampling rate

• (V,I)@1ms sampling rate

APSS, OCC, BMCSensors

Page 12: Combine Out-of-band monitoring with AI and big data for ...€¦ · Visualization Resources Management Energy efficiency Job Scheduling Heterogeneous Sensors Common Interface CRAC

Measurements Synchronization

Considering sampling period of 20us [1]:

• NTP allows measurements synchronization between nodes with uncertainty of few samples (99% of the samples within 35us) perfect for topics @1s, @1ms

• PTP allows measurements synchronization with time uncertainty smaller than inter-sample time (few us of uncertainty).

3

NTP PTP

[1] Evaluation of Synchronization Protocols for fine-grain HPC sensor data time-stamping and Collection, Libri et. all, HPCS 2016

Page 13: Combine Out-of-band monitoring with AI and big data for ...€¦ · Visualization Resources Management Energy efficiency Job Scheduling Heterogeneous Sensors Common Interface CRAC

13

Application 1

Application 2

How to do that in

real-time for a

football-sized cluster

of computing nodes?

Real-time Frequency analysis on power supply and more…a live oscilloscope• For instance, using the FFT we plot the power spectral density of the power

benchmark of two applications, and we can distinguish them by the harmonics present in each of the signals

Framework Fsmax [kHz]

E4 PPBB 50

HDEEM 1

PowerInsight 1

Low overhead, accurate monitoring

Page 14: Combine Out-of-band monitoring with AI and big data for ...€¦ · Visualization Resources Management Energy efficiency Job Scheduling Heterogeneous Sensors Common Interface CRAC

Scalable Data Collection, Analytics

Sens_pub

Broker1

Sens_pub Sens_pub

Cassandranode1

MQTT

Sens_pub

BrokerM

Sens_pub Sens_pub

CassandranodeM

Grafana

Back-end• MQTT–enabled sensor

collectors

Front-end • MQTT Brokers• Data Visualization• NoSQL Storage• Big Data Analytics

ApacheSpark

Target Facility

MQTT Brokers

Applications

NoSQL

ADMIN

MQTT2Kairos MQTT2kairos

Kairosdb

Python Matlab

Page 15: Combine Out-of-band monitoring with AI and big data for ...€¦ · Visualization Resources Management Energy efficiency Job Scheduling Heterogeneous Sensors Common Interface CRAC

MQTT: MQ Telemetry Transport

•Lightweight message queuing and transport protocol•Developed by IBM and Eurotech•Well suited for low resource demanding scenarios like M2M, WSN and IoT applications•Basic features:

•PubSub model•Async communication protocol (messages)•Low overhead packet (2 bytes header)•QoS (3 levels)•Open source implementation:

•https://mosquitto.org/

PublisherTopic

(Broker)Subscriber

(mosquitto_pub) (mosquitto_sub)(mosquitto)

Page 16: Combine Out-of-band monitoring with AI and big data for ...€¦ · Visualization Resources Management Energy efficiency Job Scheduling Heterogeneous Sensors Common Interface CRAC

Cassandra Column Family

MQTT Publishers

faci

lity/

sen

sors

/B

Sens_pub_A Sens_pub_B Sens_pub_CMetric:

ATags:

facilitySensors

Metric:B

Tags:FacilitySensors

Metric:C

Tags:facilitysensors

facility/sensors/# MQTT2KairosdbMQTTBroker

MQTT to NoSQL Storage: MQTT2Kairosdb= {Value;Timestamp}

Page 17: Combine Out-of-band monitoring with AI and big data for ...€¦ · Visualization Resources Management Energy efficiency Job Scheduling Heterogeneous Sensors Common Interface CRAC

Examon Analytics: Batch & Streaming

examon-client(REST)

(Batch)

Pandasdataframe

Bahir-mqtt(Spark connector) (Streaming)

Page 18: Combine Out-of-band monitoring with AI and big data for ...€¦ · Visualization Resources Management Energy efficiency Job Scheduling Heterogeneous Sensors Common Interface CRAC

Front-end: Openstack infrastructure

• The Front-end can be deployed on the service node using the Openstackframework

• VMs run services:• Proxy• Brokers• Cassandra• Apache Spark

• All services are horizontally scalable

Volume

(256Gb)

Cass00

Volume

(256Gb)

Volume

(256Gb)

Kairosdb

Cassandra

Cass01

Cassandra

Cass03

Cassandra

Spark00

Spark

Proxy

Grafana

Broker

OpenStack (Service Node)

CLUSTER

Node

BBBNode_p

ub

Node

BBBNode_p

ub

Node

BBBNode_p

ub

Node

BBBNode_pub

MQTT

Page 19: Combine Out-of-band monitoring with AI and big data for ...€¦ · Visualization Resources Management Energy efficiency Job Scheduling Heterogeneous Sensors Common Interface CRAC

D.A.V.I.D.E. configuration

Broker0

Cassandranode0

MQTT

Broker3

Grafana

Compute Nodes•45 compute nodes•Hosts: davide1..davide45

Frontend•Host: davidefe01•Docker containers•~45K points/s

Submit jobsApplications

ADMIN

KairosDBnode0

Broker1 Broker2

IPMI, OCCdavide1-45

BBB powerdavide1-15

BBB powerdavide16-30

BBB powerdavide31-45

MQTT2Kairos

ipmi_pub liteon_pub

Job_collector(Aggregators)

Jobs Node

SLURM Extended

get_job_energy

Get energy

Users

asetek_pub

Page 20: Combine Out-of-band monitoring with AI and big data for ...€¦ · Visualization Resources Management Energy efficiency Job Scheduling Heterogeneous Sensors Common Interface CRAC

Historical Data DL Modeltrain

Embedded Board: Monitoring + Anomaly Detection

Monitoring Infrastructure

collect data

1

2

3

4

Computing Node 1

Embedded Board

Computing Node 2

Embedded Board

Computing Node N

load trained model in boards

Normal Behaviour

Anomaly

DL

online anomaly detection on live, new

data

AI+Big Data on D.A.V.I.D.E.: Anomaly detection

Page 21: Combine Out-of-band monitoring with AI and big data for ...€¦ · Visualization Resources Management Energy efficiency Job Scheduling Heterogeneous Sensors Common Interface CRAC

AI+Big Data on D.A.V.I.D.E.: Anomaly detection

Fault!!

Autoencoder basedanomaly detection

Page 22: Combine Out-of-band monitoring with AI and big data for ...€¦ · Visualization Resources Management Energy efficiency Job Scheduling Heterogeneous Sensors Common Interface CRAC

Conclusion & Future Works

• We presented an approach to conbine out-of-band monitorng and big data and AI to enable Datacenter Automation

• We proof the effectivity of our approach toward enabling automated anomaly detection of computing node

• Future Works: • Extending the approach toward Security and house-keeping tasks

in Datacenters

• Exploit OpenBMC and OpenHW to lower the cost of the approach

• Looking for partnership for bringing it to P9 systems

Page 23: Combine Out-of-band monitoring with AI and big data for ...€¦ · Visualization Resources Management Energy efficiency Job Scheduling Heterogeneous Sensors Common Interface CRAC

ACKNOWLEDGE

The Datacenter Automation TEAM

• Luca Benini, Michela Milano, Andrea Borghesi, Antonio Libri, Francesco Beneventi, Alessandro Petrella