David Turek VP High Performance and Cognitive Computing · kube-scheduler kube-apiserver etcd...

25
Data Centric Systems David Turek VP High Performance and Cognitive Computing

Transcript of David Turek VP High Performance and Cognitive Computing · kube-scheduler kube-apiserver etcd...

Data Centric Systems

David Turek

VP High Performance and Cognitive Computing

© 2016 International Business Machines Corporation

R&Din the

IT Era

Theory/Knowledge

Experiment

Simulation

Massive improvements:

• Applicability & ease of use• Simulation fidelity• Scalability• Throughput

thanks to Supercomputing

Data Centric Systems

3Source: Top500.org

Implementing Exact-Exchange in CPMD

>99% Parallel Efficiency to over 6.2M threads

Studying Li-Air Batteries, 1736 atoms, 70Ry cuttof

V. Weber, T. Laino, C. Bekas, A. Curioni, A. Bertsch, S. Futral IPDPS 13

Data Centric Systems

4

ACM Gordon Bell Prize 2013

14.4 PFLOP/S @73% of peak perf., with I/O

2 orders of magnitude improvement in

• scale of the problem (from 128 to 15K bubbles)

• time to solution

Compute specifics:

13 Trillion elements, 1.2TBytes compressed I/O

per time step, 6.4 M threads

IBM, ETHZ, TUM, LLNL

Success in Petascale computing: CFD can achieve Linpack like sustained performance

Data Centric Systems

5

ACM Gordon Bell Prize 2015

97% of sustained scalability for

a fully implicit solver. 1.6M cores

3.2M MPI processes

602B DoF,

IBM, UT Austin, NYU, CALTECH

Success in Petascale computing: Implicit linear solvers do scale!

© 2016 International Business Machines Corporation

6

New

Product

Opportunistic

Discovery

by Humans

Simulation

Experiments

R&D Today

But we cannot beat complexity with brute force simulation. Traditional

discovery has limits: We need a new, data driven, holistic approach

Data Centric Systems

Data Centric Systems

Data Centric Systems

© 2016 International Business Machines Corporation

10

Companies need to easily access

quickly growing and widely

diverse information sources.

• Highly unstructured/dark

• Current human based

approach not scalable

Domain related inference is largely

missing. Setting up and deploying the

right simulations is very hard.

• Human capital intensive, non

scalable

Internal evidence and experiments

are driven primarily empirically,

often brute force, and their results

are isolated from wider knowledge

space.

Knowledge

Evidence & Experiments

Inference & Simulation

© 2016 International Business Machines Corporation

11

Create technical area specific

knowledge space from all relevant

sources. Link with company data.

Use knowledge space to

• Drastically augment internal know-how & modeling

• Focus on which experiment is relevant

• Embed results in knowledge base

Use inference on the knowledge space

& simulation on the models

• To augment the knowledge space

• Sharpen simulation models

• Make precise decisions

Cognitive Discovery

Drastically accelerate pace

of systematic discovery

and maximize ROI for R&D

Rapid and Precise Materials R&D

drives new value for our clients

Pharma Materials Engineering &

Manufacturing

Science,

Products &

Economics

Simulations Experimental

Results

Knowledge Inference & Simulation

Evidence & Experiments

Data Centric Systems

Document Ingestion: PDF

Domain Specific Knowledge

Graphs

Domain Specific ML +

Inference

NLQ + ML Driven

Simulations

Automatic Hypothesis

Discovery

Fully Automated

Reasoning

Fully Automated

Discovery

mature

Ideation

KNOWLEDGE EXTRACTION &

REPRESENTATION

INFERENCE DRIVEN

SIMULATIONS

AUTOMATED TECHNICAL

REASONING

Data Centric Systems

13

Literature ReviewNon scalable, human based outsourcing:

• Limited sources

• Non-systematic; limited re-use

1

Chemical/Physical/Eng. modeling & simulations

• Expert material scientists

• Empirical: no inference

• Trial and error based: no systematic knowledge buildup

2

Lab tests

Time/money costly

• Empirical (slow: many tests)

• No systematic knowledge buildup & connection

3

YearsMonths Months Months

INGESTION SIMULATION ANALYSIS

Data Centric Systems

Weeks

Deep

Search

Lab tests experiments data

Simulation

& Inference

Scientific literature & internal

reports

Design alloys to avoid catastrophic failure that can

lead to huge liabilities

• Corrosion

• Cracks

• Special environmental and deployment

conditions

DAYS

Knowledge

space

• Atomistic simulations

• Deep Learning based property prediction

Data Centric Systems

Pdf-parser:

• Parses the pdf-code and presents the raw data of the pdf (text-cells, embedded images and vector-graphics in consumable format)

Pdf-interpreter:

• Captures ground truth by massive Crowd-sourcing big Data system

• Uses HPC for ML-techniques (Deep Leaning), to train automatic annotation models

Semantic-representation:

• Uses HPC & Big Data systems to to obtain a semantic representation in JSON-format of the original text

Billions of documentsMillions of concurrent users

Data Centric Systems

16

Literature review & internal lab logs analysis

Limited scope: structured only

1

Independent empirical model buildup

2

Lab tests experiments

on independent testbeds

• Takes years to calibrate

• No direct correlation across testbeds

3

YearsMonths Years Years

Chemistry Mechanics Electrics

An Engineering Case Study: Current linear approach

Data Centric Systems

17

Missing Link in Knowledge Graph

InferenceDesign Simulation

Valve geometry nodePiston geometry node

CFD simulation

No link

Value

Link with weight

Knowledge Graph • Academic literature & information on internal combustion engines• Links lab data with literature: fuel combustion + piston geometry

+ …

New cylinder/piston/injector geometries• Use Knowledge Graph to quickly rule out non-viable design

directions• Augment missing information & perform validation with advanced

CFD

Where to augment: Knowledge analytics• Adding link if significantly changing the quality of the knowledge

graph • Only specific and well defined simulations need to be done

An Engineering Case Study: Enriching knowledge space with simulation

WHAT WAS TAKING YEARS CAN NOW BE DELIVERED IN DAYS

Data Centric Systems

• Typically HPC development is focused

on increased speed.

• The fastest calculation is the one

which you don’t run!

• Can we use machine learning to make

better decisions on which simulations

give the most value?

• Can we use machine learning to

improve resolution of information?

‘Cognitive’ workflow uses 1/3 of the calculations to achieve 4 orders of magnitude resolution increase

Data Centric Systems

3/22/2018IBM Confidential 2017 Data Centric Systems 19

• A private cloud platform for enterprises to develop and run their workloads locally

• An integrated platform consisting of Kubernetes and developer services necessary to create, run, and manage cloud applications

• Support for Deep Learning , Batch Processing, GPUs, Microservices, CI/CD,….

• Platform to deliver modernized IBM middleware and data services to enterprise customers

https://ibm.biz/cloud-private

IBM Cloud private for Cognitive and HPC

Data Centric Systems

3/22/2018 20

application

service

Remote Users

K8 MasterK8 Worker NK8 Worker 2K8 Worker 1K8 Worker 0

….

IBM Data Broker

Physics

Cognitive/ML

Analytics VizSPARK Graph Analytics

CPU

s

IB

NV

Me

HD

D

Accel

CPU

s

IB

NV

Me

HD

D

Accel

CPU

s

IB

NV

Me

HD

D

Accel

CPU

s

IB

NV

Me

HD

D

Accel

CPU

s

IB

NV

Me

HD

D

Accel

OS/DD OS/DD OS/DD OS/DD OS/DD

kubelet kubelet kubelet kubelet kubelet

kubectl

kube-ctrlr-manager

batchjob-ctlr

kube-scheduler

kube-apiserver

etcd

docker docker docker docker docker

New component being proposed by Spectrum team

Data Centric Systems Test Cluster• 26 Power 8+ Minsky servers

• 2xP8+ CPU/node (20 cores, 160 threads)• 4xP100 Nvidia GPUs/node

• 512GB DRAM/node• 3.2TB NVMe drives/node• Spectrum Scale Storage• EDR Infiniband

IBM Cloud private base + HPC/Cognitive components

Data Centric Systems

3/22/2018 21

POWER, x86 clusters + GPU NVMe, 10GbE, IB, ESS/GPFS

Infrastructure management : GT/SoftLayer

KVM Bare-metal node

Object store Docker Registry

Container service: Kubernetes Kube-proxy

DL Training (DLaaS) SparkIB optimized

DL Insight

Image classifier model training

Image classification

Deep learning workloads : Image classification

GATK4

Spark workload

CI/

CD

envir

onm

ent

Technical Computing (TCaaS)

ROpenFOAM

OctaveSci-Viz

MOAB

Tech Computing

CompChemistry

Engineering

IBM Data Broker

Data Centric Systems

Hybrid Cloud – System of Systems

On-Premise

System Orchestrator

Application Server

Framework Components

On-Premise

System Orchestrator

Application Server

Framework Components

Commercial Cloud

System Orchestrator

Application Server

Framework Components

Cloud Broker==IBM Cloud private

Data

Pro

ducers D

ata

C

onsum

ers

System of Systems Orchestrator

Data and SW Data, Results and SW

Data and SW

3/22/2018 22

Data Centric Systems

System Components

Managem

ent

Securi

ty &

Pri

vacy

IBM Cloud privateconduit to variety of systems

Application Server

Framework Components

Data

Pro

ducers

Collection

Curation

Analytics

Visualization

AccessKnowledge

Library

Scheduling

• Batch• Interactive• Fault tolerant

IO

• Object• File• Streaming

Infrastructure

• VM• Container• Baremetal

Resource Management

Data

Consum

ers

Messagin

g a

nd

C

om

munic

ati

on

3/22/2018IBM Confidential 23

Data Centric Systems

On-prem, customer managed

(Bluemix Local)

IBM Cloud

private

X86, Power & Z X86 based systems

On-prem,

IBM

managed

Off-prem, IBM managed

(Bluemix Public or Dedicated)

Linux

3/22/2018IBM Confidential 24

kube-arbitrator

GPFS/Parallel object store

Spectrum MPI

Spectrum LSF Conductor w/Spark Symphony

XLc/C/Fortran

Compute Accelerators (GPUs, AI, FPGA, etc.)//High Performance Network (RoCE, IB, RRC)//NVMe,Flash

Math librariesESSL, GPU, AI

AI frameworks (PowerAI,DLaaS)

Workflow Managers (TCaaS)

HPC, AI Applications

xC

AT

Pro

vis

ionin

g

Ubiquity Storage drivers

Data Centric Systems

Knowledge

Space

Simulation

Weeks

Evidence/Experiments

• Supercomputing

• Quantum and new computing paradigms

• Inference (ML)

Ingest data and create massive knowledge spaces

Link evidence with knowledge spaces. Drive deep search