IBM HPC/HPDA/AI Solutions - HPC Knowledge Portal · 2020. 5. 22. · (Oracle,ERP, HPC Cluster)...

Post on 28-Feb-2021

6 views 0 download

Transcript of IBM HPC/HPDA/AI Solutions - HPC Knowledge Portal · 2020. 5. 22. · (Oracle,ERP, HPC Cluster)...

IBM HPC/HPDA/AI

Solutions

Albert Valls Badia IBM Client Technical Architect

IBM Systems Hardware

Albert_Valls@es.ibm.com

June 15th , 2017

2

New Drivers and Directions – Datacentric

• Data Volumes are Exploding – Especially Unstructured Data

• Data Needs to e Colle ted, Ma aged, a d Digested

• Deriving Insight and Information from the Data requires:

• A variety of pro essi g steps i a Workflo

• A variety of processing optimizations

• Many Analytics Steps can make use of Large In Memory

Solvers

• Energy Efficiency requires:

• Processing Elements that are Optimized to the task

• Energy and Data aware Workflow Management

• The Open Power Foundation provides innovation

opportunities to a variety of Partners

• Making innovations like A elerators Co su a le is

critical

Pri

ce/P

erf

orm

an

ce

Full system stack innovation required

Technology

and

Processors

200

0

20

20

Firmware / OS

Accelerators Software Storage Network

Workflow

Dependency Graph

OpenPOWER and Innovation (strategy started in 2014)

IBM Stack

Research

And

Innovation

IBM

Google

NVIDIA

TYAN

Mellanox OpenPower

Open Innovation

OpenPOWER: Bringing Partner Innovation to Power Systems

5 initial members

200+ members

24 countries

OpenPOWER Innovation Pervasive in System Design (21 TFlops/node)

4

NVIDIA:

Tesla P100 GPU with NVLink

NVLink Interface

Ubuntu by Canonical:

Launch OS supporting NVLink and Page

Migration Engine

Wistron: Platform co-design

Mellanox: InfiniBand/Ethernet

Connectivity in and out of server

Samsung:

2.5” SSDs

HGST: Optional NVMe Adapters

Hynix, Samsung, Micron: DDR4

IBM: POWER8 CPU

POWER8: Leadership performance - designed for Memory Intensive Workloads

5

Memory Buffer

DRAM Chips

POWER8

12 cores 96 threads 4 cache levels

Up to 1/2 TB per socket Up to 230 GB/s sustained

Consistent speed

Faster cores

8 Threads per Core

Bigger cache

Accelerator direct

links

3x higher memory

bandwidth,

1 TB/Socket

Differentiated Acceleration - CAPI and NVLink

New Ecosystems with CAPI

Partners innovate, add value,

gain revenue together w/IBM

Technical and programming

ease: virtual addressing, cache

coherence

Accelerator is hardware peer FPGA or ASIC

NVIDIA Tesla GPU with NVLink

POWER8

with NVLink

80 GB/s

Peak*

Graphics Memory Graphics Memory

System Memory

40+40 GB/s

Coherence Bus

POWER8

CAPP

CAPI-attached Accelerators

Future, Innovative Systems with NVLink

Faster GPU-GPU communication

Breaks down barriers between CPU-GPU

New system architectures

PSL

6

IBM Power Accelerated Computing Roadmap

2015 2016 2017

POWER8 POWER8 with NVLink

POWER9

CAPI

Interface

NVLink

Enhanced

CAPI &

NVLink

ConnectX-4 EDR Infiniband

PCIe Gen3

ConnectX-4 EDR Infiniband

CAPI over PCIe Gen3

HDR Infiniband Enhanced CAPI over PCIe Gen4

Mellanox Interconnect Technology

IBM CPUs

NVIDIA GPUs Kepler

PCIe Gen3 Volta

Enhanced NVLink Pascal NVLink

S822LC – Firesto e

Server

S8 LC for HPC Mi sk

POWER10

2020+

Witherspoo

TBD

TBD

System Name TBD

7/3/2017 8

FLOPS are not the only PKI in HPC: example workflow in seismic analysis.

• Read from storage

• Memory load

• Preporcessing

• Realtime algorithm execution

• Visualization and Insight

• Simulation and modeling Every step in the workflow takes advantage of different

hardware capabilities. Therefore the need for a balanced

system design.

IBM Data Centric Computing Strategy: HPC->HPDA

Introducing IBM Spectrum Scale

• Remove data-related bottlenecks with a parallel, scale-out solution

• Enable global collaboration with unified storage and global namespace

• Optimize cost and performance with automated data placement

• Ensure data availability, integrity and security with erasure coding, replication, snapshots, and encryption

Highly scalable high-performance unified storage

for files and objects with integrated analytics

Unified Scale-out Data Lake

• File In/Out, Object In/Out; Analytics on demand.

• High-performance native protocols

• Single Management Plane

• Cluster replication & global namespace

• Enterprise storage features across file, object & HDFS

Spectrum Scale

SSD Disk

Fast Disk

Slow Disk

Tape

SSNR

Compression

NFS SMB POSIX Swift/S3 HDFS

Encryption

SSD Disk

Fast Disk

Slow Disk

| 11

IBM Spectrum Scale: Parallel Architecture

| 12

No Hot Spots

• All NSD servers export to all clients in active-active mode

• Spectrum Scale stripes files across NSD servers and NSDs in units of file-system block-size

• File-system load spread evenly

• Easy to scale file-system capacity and performance while keeping the architecture balanced

NSD Client does real-time parallel I/O

to all the NSD servers and storage volumes/NSDs

NSD Client

NSD Servers

Storage Storage

Ethernet Network (TCP/IP) or Low Latency Network (Infiniband)

Heterogeneous Block

Storage

Block Storage

JBODs

JBODs

JBODs

Block Storage

IBM Elastic Storage

Solution

Spectrum Scale Native RAID Controllers Spectrum Scale File

Servers

Commodity Servers

(x86_64 or Power)

Application Nodes

(Oracle,ERP, HPC

Cluster)

Spectrum Scale Clients

/file_systemA

Spectrum Scale Protocol

Nodes

NFS, SMB, OpenStack

Swift

NFS exports

SMB Shares

HTTP GET/PUT (Swift)

Spectrum Scale NSD Protocol

NFS Clients

SMB Clients

OpenStack Swift Clients

Clustered

Failover

Up to 16 (SMB) or

32 (NFS) servers

Servers use Disk

Volumes/LUNs

File-system load spread

evenly across all the

servers. No Hot Spots

Data is stripped across

servers in block-size

No single-server

bottleneck

Can share access to

data with NFS, SMB and

Swift S3

Easy to scale while

keeping the architecture

balanced

Can add capacity and

performance

/file_systemA

/file_systemA

Spectrum Scale Cluster Overview

Spectrum Scale Architecture Highlights: Scalability

Data scalability

Capacity: Large number of disks/LUNs in a single file system

Throughput: wide striping, large block size

Capacity efficient (data in i-node, fragments)

Multiple nodes write in parallel (even within single file)

Metadata scalability

Wide striping of all metadata (inodes, indirect blocks, directories, allocation maps...)

Scalable data structures: Segmented allocation map,

Extensible hashing for directories

Highly scalable, distributed lock manager:

After o tai i g lo k toke , each node can cache metadata, update locally, write back directly

Fine-grain locking, when necessary: shared inode write locks, byte-range locks

lock directory entries by name (hash)

Dy a i ally ele ted metanode collects inode, ind block & directory updates

Speed and simplicity: Graphical user interface

• Reduce administration overhead • Graphical User Interface for common tasks

• Performance monitoring

• Problem determination

• Easy to adopt • Common IBM Storage UI Framework

• Integrated into Spectrum Control • Storage portfolio visibility

• Consolidated management

• Multiple clusters

Spectrum Scale Built-in Tiering (ILM) Challenge

• Data growth is outpacing budget

• Low-cost archive is another storage silo

• Flash is u der utilized e ause it is t shared

• Lo ally atta hed disk a t e used ith e tralized storage

• Migration overhead is preventing storage upgrades

• Automated data placement

• Span entire storage portfolio, including DAS, with a single namespace

• Policy driven data placement & data migration

• Share storage, even low-latency flash

• Automatic failover and seamless file-system recovery

• Lower TCO

• Powerful policy engine

• Information Lifecycle Management

• Fast etadata s a i g a d data o e e t

• Automated data migration to based on threshold

• Users not affected by data migration

• Example: Online storage reaches 90% full then move all 1GB or larger files that are 60 days old to offline to free up space

Small files last accessed > 30 days

last accessed > 60days

Silver pool is >60% full Drain it to 20%

accessed today and file size is <1G

System pool

(Flash)

Gold pool

(SSD)

Silver pool

( NL SAS)

Automation

Spectrum Scale HDFS Transparency

Challenge • Separate storage systems for ingest, analysis, results

• HDFS requires locality aware storage (namenode)

• Data transfer slows time to results • Different frameworks & analytics tools use data

differently

• HDFS Transparency

• Map/Reduce on shared, or shared nothing storage

• No waiting for data transfer between storage systems

• Immediately share results • Si gle Data Lake for all appli atio s • Enterprise data management • Archive and Analysis in-place

A A A

Existing System

Analytics

System Data

ingest

Export

result

Traditional Analytics

Solution

A A A

Existing System

Spectrum Scale File System File Object

Analytics

System

HDFS

Transparency

In-place Analytics Solution

Spectrum Scale Compression

• Transparent compression for HDFS transparency, Object, NFS, SMB and POSIX interface.

• Improved storage efficiency

• Typically 2x improvement in storage efficiency

• Improved I/O bandwidth

• Read/write compressed data reduces load on storage

• Improved client side caching

• Caching compressed data increases apparent cache size

• Per file compression

• Use policies

• Compress cold data

– Data not being used/accessed

18

Spectrum Scale Encryption

• Native Encryption of data at rest

• Files are encrypted before they are stored on disk

• Keys are never written to disk

• No data leakage in case disks are stolen or improperly decommissioned

• Secure deletion

• Ability to destroy arbitrarily large subsets of a file system

• No ‘digital shredding’, no overwriting: Security deletion is a cryptographic operation

• Use Spectrum Scale Policy to encrypted (or exclude) files in fileset or file system

• Generally < 5% performance impact

Benefits

• Expands local node file cache (Pagepool)

• Leverages fast local storage

• Can reduce load on central storage

• Transparent to applications

• Can use inexpensive local devices

Where to use it

• Protocol Node

• Virtual Machine storage

• Large Memory Analytics

Easy to enable

NSD Type localCache

Define only this node as NSD server

LROC LROC

Application Nodes

Performance Feature Spectrum Scale Local Read-Only Cache (LROC)

Benefits

• Speeds-up small writes

• Used by IBM Elastic Storage Server

Where to use it

• Logs handle small writes

• Any storage architecture

• Shared Disk

• Shared Nothing (Use replication)

• IO Sizes up to 64KiB

Easy to enable

Create a system.log pool

Enable write-cache on the file system

Application Nodes

Performance Feature Spectrum Scale Highly Available Write Cache (HAWC)

Flash

Local Storage

Shared Storage

Spectrum Scale Multicluster: cross-cluster sharing

22

• Cross-mounting file systems between Spectrum Scale clusters

• Separate clusters = separate administration domains

• When connection is established, all nodes are interconnected

– All nodes in both clusters must be within same IP network segment / VLAN

– Channel can be encrypted (openssl)

Synchronous Replication & Stretched Cluster

• Performed synchronously by the node who writes to disk

• Synchronous replication happens within Spectrum Scale cluster

• I/O it does not return to the application until both copies are written

• Active/Active data access

• Read from fastest source

• DR with automatic failover and seamless file-system recovery

• If replication between sites -> Spectrum Scale Stretched Cluster

Synchronous

replication

Application

Whichever

is fastest

23

Spectrum Scale Active File Management (AFM) • An asynchronous, cross-cluster, data-sharing utility

• Functions well over unreliable and high latency networks

• Extends global name space between multiple WAN dispersed locations to share and exchange data asynchronously

• Ca hes lo al opies of data distri uted to o e or ore lusters to i pro e lo al read and write performance

• As data is written or modified at one location, all other locations see that same data

24

Spectrum Scale AFM Main Concepts

• Home - Where the information lives. Owner of the data in a cache relationship

• Cache - Fileset in a remote cluster that points to home

• The relationship between a Cache and Home is one to one

• Cache knows about its Home. Home does not know a cache exists

• Data is copied to the cache when requested or data written at the cache is copied back to home as fast as possible

25

Spectrum Scale Server 1

Spectrum Scale Server 2

Clients

FDR IB

10/40 GbE

IBM Elastic Storage Server (ESS) is a Software Defined Solution

Migrate RAID

and disk

management

to commodity

file servers !

Custom dedicated

Disk Controllers

JBOD Disk

enclosures

Spectrum Scale Server 1

Spectrum Scale Server 2

Clients

Spectrum Scale RAID

Commodity file

servers

FDR IB

10/40 GbE

JBOD Disk

enclosures

Spectrum Scale RAID

Commodity file

servers with

RAID and disk

management

Spectrum Scale Native RAID is a software implementation of storage RAID technologies within

Spectrum Scale.

It requires special Licensing

It is only approved for pre-certified architectures such as Lenovo-GSS, IBM-ESS (Elastic Storage

Server)

Advantages of Spectrum Scale RAID

• Use of standard and i e pe sive disk drives • Erasure Code software implemented in Spectrum Scale

• Data is declustered and distributed to all disk drives with selected RAID protection

• 3-way, 4-way, RAID6 8+2P, RAID6 8+3P

• Faster rebuild times • As data is declustered, more disks are involved during rebuild

• Approx. 3.5 times faster than RAID-5

• Minimal impact of rebuild on system performance • Rebuild is done by many disks

• Rebuilds can be deferred with sufficient protection

• Better fault tolerance • End to end checksum

• Much higher mean-time-to-data-loss (MTTDL)

JBODs

Spectrum Scale RAID

RAID algorithm • Two types of RAID:

• 3 or 4 way replication

• 8 + 2 or 3 way parity

• 2-fault and 3-fault tolera t odes RAID-D2, RAID-D3

3-way Replication (1+2) 8 + 2p Reed Solomon 2-fault

tolerant

codes

3-fault

tolerant

codes

1 strip

(GPFS

block)

2 or 3

replicated

strips

4-way Replication (1+3)

8 strips

(GPFS block)

2 or 3

redundancy

strips

8 + 3p Reed Solomon

Rebuild overhead reduction example

| 31

Declustered RAID6 example

Critical Rebuild Performance on GL6 8+2p

JBODs

Spectrum Scale RAID

As one can see

during the critical

rebuild impact on

workload was high,

but as soon as we

were back to a

single parity

protection the

impact to the

customers

workload was <2%

Data Integrity Manager

prioritizes tasks:

Rebuild, Rebalance,

Data scrubbing and

proactive correction

6 minutes for a critical rebuild

End-to-end checksum • True end-to-end checksum fro disk surfa e to lie t s Spe tru S ale i terfa e

• Repairs soft/latent read errors

• Repairs lost/missing writes.

• Checksums are maintained on disk and in memory and are transmitted to/from client.

• Checksum is stored in a 64-byte trailer of 32-KiB buffers • 8-byte checksum and 56 bytes of ID and version info

• Sequence number used to detect lost/missing writes.

8 data strips 3 parity strips

32-KiB buffer

64B trailer

¼ to 2-KiB

terminus

IBM Elastic Storage Server family GS odels use U . JBODs or SSDs

Support drives: . TB, .8TB SAS, GB, 8 GB, . TB SSD .

GL odels use U . JBODs

Support drives: 4TB,6TB,8TB NL-SAS . HDDs

Supported NICs: 10GbE, 40GbE Ethernet and FDR or EDR Infiniband

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

FC

5887

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

FC

58

87

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

FC

58

87

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

FC

58

87

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

FC

58

87

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

FC

58

87

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

FC

58

87

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

FC

5887

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

FC

588

7

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

FC

5887

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

FC

58

87

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

FC

5887

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

FC

5887

Net Capacity

4TB = 327TB

6TB = 491TB

8TB = 655TB

Net Capacity

4TB = 673TB

6TB = 1PB

8TB = 1.3PB

Net Capacity

4TB = 1PB

6TB = 1.5PB

8TB = 2PB

Model GL4

Analytics and Cloud 4 Enclosures, 20U

232 NL-SAS, 2 SSD

10 to 16 GB/Sec

Model GL6

PetaScale Storage 6 Enclosures, 28U

348 NL-SAS, 2 SSD

10 to 25 GB/sec

Model GL2

Analytics Focused 2 Enclosures, 12U

116 NL-SAS, 2 SSD

5 - 8 GB/Sec

Model GS1 24 SSD

6 GB/Sec

Model GS2 46 SAS + 2 SSD or

48 SSD Drives

2 GB/Sec SAS

12 GB/Sec SSD

Model GS4 94 SAS + 2 SSD or

96 SSD Drives

5 GB/Sec SAS

16 GB/Sec SSD

Model GS6 142 SAS + 2 SSD

7 GB/Sec

Net Capacity

1.2TB = 121TB

1.6TB = 182TB

Net Capacity

400GB = 28TB

800GB = 57TB

1.6TB = 115TB

1.2TB = 78TB

1.6TB = 117TB

Net Capacity

400GB = 13TB

800GB = 26TB

1.6TB = 53TB

1.2TB = 35TB

1.6TB = 53TB

Net Capacity

400GB = 6TB

800GB = 13TB

1.6TB = 26TB

ESS New Models Performance and Capacity

Spectrum

Scale

ESS

New! Model GL2S: 2 Enclosures, 14U 166 NL-SAS, 2 SSD

New! Model GL4S: 4 Enclosures, 24U 334 NL-SAS, 2 SSD

New! Model GL6S: 6 Enclosures, 34U 502 NL-SAS, 2 SSD

ESS

5U84 Storage

ESS

5U84 Storage

Max: .9PB raw Max: 1.6PB raw Max: 1.8PB raw Max: 3.3PB raw Max: 2.8PB raw Max: 5PB raw

Model GL2: 2 Enclosures, 12U 116 NL-SAS, 2 SSD

ESS

5U84

Storage

ESS

5U84 Storage

ESS

5U84

Storage

ESS

5U84

Storage

Model GL6: 6 Enclosures, 28U 348 NL-SAS, 2 SSD

Model GL4: 4 Enclosures, 20U 232 NL-SAS, 2 SSD

ESS

5U84

Storage

ESS

5U84

Storage

ESS

5U84

Storage

ESS

5U84

Storage

ESS

5U84 Storage

ESS

5U84

Storage

34 GB/s

25 GB/s

17 GB/s

11 GB/s

8 GB/s

23 GB/s

Net Capacity

4TB = 1.5PB

8TB = 3.1PB

10TB = 3.9PB

Net Capacity

4TB = 1PB

8TB = 2PB

10TB = 2.5PB

Net Capacity

4TB = 508PB

8TB = 1PB

10TB = 1.27PB

Sequential throughput vs. Capacity

Software Defined Compute: IBM Platform Computing Delivering a highly utilized shared services environment optimized for time to results

Application Examples

• Simulation

• Analysis

• Design

• Big data

IT constrained

• Long wait times

• Low utilization

• IT Sprawl

IBM Platform Computing

Big Data /

Hadoop

Simulation

& Modeling Analytics

Traditional Software Defined

Benefits

• High utilization

• Throughput

• Performance

• Prioritization

• Reduced cost Repeated for many apps and groups

• Clusters

• Grid

• Cloud

Faster results

Fewer resources

Long Running

Services Make lots of computers look like “one”

Prioritized matching of supply with demand

Application

Overall Artificial Intelligence (AI) Space

39

Machine Learning

Deep Learning IT Systems break tasks into

Artificial Neural Networks

New Data

Sources:

NoSQL,

Hadoop &

Analytics

New class of applications

Machine Learing & Training

Pattern matching

Image

Real-time decision support

Complex workflows

Data Lakes

Extend Enterprise applications

Finance: Fraud detection /

prevention

Retail: shopping advisors

Healthcare: Diagnostics and

treatment

Supply chain and logistics

Extend Predictive Analytics to

Advance Analytics with AI

Human Intelligence Exhibited by Machines

Cognitive / ML/DL

“Human Trained” using large amounts of data & ability to learn how to perform the

task

Growing across Compute, Middleware, and Storage

PowerAI Platform

40

Caffe NVCaffe Torch IBMCaffe

DL4J TensorFlow

OpenBLAS

Theano

Deep Learning

Frameworks

Accelerated

Servers and

Infrastructure

for Scaling

Spectrum Scale:

High-Speed

Parallel File System

Scale to

Cloud

Cluster of NVLink

Servers

Coming Soon

Bazel DIGITS NCCL Distributed

Frameworks

Supporting

Libraries

Where to start?

• 20 x POWER8 cores with NvLink • Hasta 1TB DDR4 Mem with NvLink • Hasta Tesla P ’s . cores

+

Parallel Computing

Ej. Universidad Carlos III

Barcelona Supercomputing Center

GPU development

And optimisation

Ej. Molecular dynamics.

Centro de Biología Molecular

Machine Learning

Deep Learning

20 Core POWER8 + 256GB + 1

GPU Nvidia Volta

Starting at 27.500 € + IVA

IBM Power System S822LC

The Deep Learning Server

Questions?

42