Post on 28-Feb-2021
IBM HPC/HPDA/AI
Solutions
Albert Valls Badia IBM Client Technical Architect
IBM Systems Hardware
Albert_Valls@es.ibm.com
June 15th , 2017
2
New Drivers and Directions – Datacentric
• Data Volumes are Exploding – Especially Unstructured Data
• Data Needs to e Colle ted, Ma aged, a d Digested
• Deriving Insight and Information from the Data requires:
• A variety of pro essi g steps i a Workflo
• A variety of processing optimizations
• Many Analytics Steps can make use of Large In Memory
Solvers
• Energy Efficiency requires:
• Processing Elements that are Optimized to the task
• Energy and Data aware Workflow Management
• The Open Power Foundation provides innovation
opportunities to a variety of Partners
• Making innovations like A elerators Co su a le is
critical
Pri
ce/P
erf
orm
an
ce
Full system stack innovation required
Technology
and
Processors
200
0
20
20
Firmware / OS
Accelerators Software Storage Network
Workflow
Dependency Graph
OpenPOWER and Innovation (strategy started in 2014)
IBM Stack
Research
And
Innovation
IBM
NVIDIA
TYAN
Mellanox OpenPower
Open Innovation
OpenPOWER: Bringing Partner Innovation to Power Systems
5 initial members
200+ members
24 countries
OpenPOWER Innovation Pervasive in System Design (21 TFlops/node)
•
4
NVIDIA:
Tesla P100 GPU with NVLink
NVLink Interface
Ubuntu by Canonical:
Launch OS supporting NVLink and Page
Migration Engine
Wistron: Platform co-design
Mellanox: InfiniBand/Ethernet
Connectivity in and out of server
Samsung:
2.5” SSDs
HGST: Optional NVMe Adapters
Hynix, Samsung, Micron: DDR4
IBM: POWER8 CPU
POWER8: Leadership performance - designed for Memory Intensive Workloads
5
Memory Buffer
DRAM Chips
POWER8
12 cores 96 threads 4 cache levels
Up to 1/2 TB per socket Up to 230 GB/s sustained
Consistent speed
Faster cores
8 Threads per Core
Bigger cache
Accelerator direct
links
3x higher memory
bandwidth,
1 TB/Socket
Differentiated Acceleration - CAPI and NVLink
New Ecosystems with CAPI
Partners innovate, add value,
gain revenue together w/IBM
Technical and programming
ease: virtual addressing, cache
coherence
Accelerator is hardware peer FPGA or ASIC
NVIDIA Tesla GPU with NVLink
POWER8
with NVLink
80 GB/s
Peak*
Graphics Memory Graphics Memory
System Memory
40+40 GB/s
Coherence Bus
POWER8
CAPP
CAPI-attached Accelerators
Future, Innovative Systems with NVLink
Faster GPU-GPU communication
Breaks down barriers between CPU-GPU
New system architectures
PSL
6
IBM Power Accelerated Computing Roadmap
2015 2016 2017
POWER8 POWER8 with NVLink
POWER9
CAPI
Interface
NVLink
Enhanced
CAPI &
NVLink
ConnectX-4 EDR Infiniband
PCIe Gen3
ConnectX-4 EDR Infiniband
CAPI over PCIe Gen3
HDR Infiniband Enhanced CAPI over PCIe Gen4
Mellanox Interconnect Technology
IBM CPUs
NVIDIA GPUs Kepler
PCIe Gen3 Volta
Enhanced NVLink Pascal NVLink
S822LC – Firesto e
Server
S8 LC for HPC Mi sk
POWER10
2020+
Witherspoo
TBD
TBD
System Name TBD
7/3/2017 8
FLOPS are not the only PKI in HPC: example workflow in seismic analysis.
• Read from storage
• Memory load
• Preporcessing
• Realtime algorithm execution
• Visualization and Insight
• Simulation and modeling Every step in the workflow takes advantage of different
hardware capabilities. Therefore the need for a balanced
system design.
IBM Data Centric Computing Strategy: HPC->HPDA
Introducing IBM Spectrum Scale
• Remove data-related bottlenecks with a parallel, scale-out solution
• Enable global collaboration with unified storage and global namespace
• Optimize cost and performance with automated data placement
• Ensure data availability, integrity and security with erasure coding, replication, snapshots, and encryption
Highly scalable high-performance unified storage
for files and objects with integrated analytics
Unified Scale-out Data Lake
• File In/Out, Object In/Out; Analytics on demand.
• High-performance native protocols
• Single Management Plane
• Cluster replication & global namespace
• Enterprise storage features across file, object & HDFS
Spectrum Scale
SSD Disk
Fast Disk
Slow Disk
Tape
SSNR
Compression
NFS SMB POSIX Swift/S3 HDFS
Encryption
SSD Disk
Fast Disk
Slow Disk
| 11
IBM Spectrum Scale: Parallel Architecture
| 12
No Hot Spots
• All NSD servers export to all clients in active-active mode
• Spectrum Scale stripes files across NSD servers and NSDs in units of file-system block-size
• File-system load spread evenly
• Easy to scale file-system capacity and performance while keeping the architecture balanced
NSD Client does real-time parallel I/O
to all the NSD servers and storage volumes/NSDs
NSD Client
NSD Servers
Storage Storage
Ethernet Network (TCP/IP) or Low Latency Network (Infiniband)
Heterogeneous Block
Storage
Block Storage
JBODs
JBODs
JBODs
Block Storage
IBM Elastic Storage
Solution
Spectrum Scale Native RAID Controllers Spectrum Scale File
Servers
Commodity Servers
(x86_64 or Power)
Application Nodes
(Oracle,ERP, HPC
Cluster)
Spectrum Scale Clients
/file_systemA
Spectrum Scale Protocol
Nodes
NFS, SMB, OpenStack
Swift
NFS exports
SMB Shares
HTTP GET/PUT (Swift)
Spectrum Scale NSD Protocol
NFS Clients
SMB Clients
OpenStack Swift Clients
Clustered
Failover
Up to 16 (SMB) or
32 (NFS) servers
Servers use Disk
Volumes/LUNs
File-system load spread
evenly across all the
servers. No Hot Spots
Data is stripped across
servers in block-size
No single-server
bottleneck
Can share access to
data with NFS, SMB and
Swift S3
Easy to scale while
keeping the architecture
balanced
Can add capacity and
performance
/file_systemA
/file_systemA
Spectrum Scale Cluster Overview
Spectrum Scale Architecture Highlights: Scalability
Data scalability
Capacity: Large number of disks/LUNs in a single file system
Throughput: wide striping, large block size
Capacity efficient (data in i-node, fragments)
Multiple nodes write in parallel (even within single file)
Metadata scalability
Wide striping of all metadata (inodes, indirect blocks, directories, allocation maps...)
Scalable data structures: Segmented allocation map,
Extensible hashing for directories
Highly scalable, distributed lock manager:
After o tai i g lo k toke , each node can cache metadata, update locally, write back directly
Fine-grain locking, when necessary: shared inode write locks, byte-range locks
lock directory entries by name (hash)
Dy a i ally ele ted metanode collects inode, ind block & directory updates
Speed and simplicity: Graphical user interface
• Reduce administration overhead • Graphical User Interface for common tasks
• Performance monitoring
• Problem determination
• Easy to adopt • Common IBM Storage UI Framework
• Integrated into Spectrum Control • Storage portfolio visibility
• Consolidated management
• Multiple clusters
Spectrum Scale Built-in Tiering (ILM) Challenge
• Data growth is outpacing budget
• Low-cost archive is another storage silo
• Flash is u der utilized e ause it is t shared
• Lo ally atta hed disk a t e used ith e tralized storage
• Migration overhead is preventing storage upgrades
• Automated data placement
• Span entire storage portfolio, including DAS, with a single namespace
• Policy driven data placement & data migration
• Share storage, even low-latency flash
• Automatic failover and seamless file-system recovery
• Lower TCO
• Powerful policy engine
• Information Lifecycle Management
• Fast etadata s a i g a d data o e e t
• Automated data migration to based on threshold
• Users not affected by data migration
• Example: Online storage reaches 90% full then move all 1GB or larger files that are 60 days old to offline to free up space
Small files last accessed > 30 days
last accessed > 60days
Silver pool is >60% full Drain it to 20%
accessed today and file size is <1G
System pool
(Flash)
Gold pool
(SSD)
Silver pool
( NL SAS)
Automation
Spectrum Scale HDFS Transparency
Challenge • Separate storage systems for ingest, analysis, results
• HDFS requires locality aware storage (namenode)
• Data transfer slows time to results • Different frameworks & analytics tools use data
differently
• HDFS Transparency
• Map/Reduce on shared, or shared nothing storage
• No waiting for data transfer between storage systems
• Immediately share results • Si gle Data Lake for all appli atio s • Enterprise data management • Archive and Analysis in-place
A A A
Existing System
Analytics
System Data
ingest
Export
result
Traditional Analytics
Solution
A A A
Existing System
Spectrum Scale File System File Object
Analytics
System
HDFS
Transparency
In-place Analytics Solution
Spectrum Scale Compression
• Transparent compression for HDFS transparency, Object, NFS, SMB and POSIX interface.
• Improved storage efficiency
• Typically 2x improvement in storage efficiency
• Improved I/O bandwidth
• Read/write compressed data reduces load on storage
• Improved client side caching
• Caching compressed data increases apparent cache size
• Per file compression
• Use policies
• Compress cold data
– Data not being used/accessed
18
Spectrum Scale Encryption
• Native Encryption of data at rest
• Files are encrypted before they are stored on disk
• Keys are never written to disk
• No data leakage in case disks are stolen or improperly decommissioned
• Secure deletion
• Ability to destroy arbitrarily large subsets of a file system
• No ‘digital shredding’, no overwriting: Security deletion is a cryptographic operation
• Use Spectrum Scale Policy to encrypted (or exclude) files in fileset or file system
• Generally < 5% performance impact
Benefits
• Expands local node file cache (Pagepool)
• Leverages fast local storage
• Can reduce load on central storage
• Transparent to applications
• Can use inexpensive local devices
Where to use it
• Protocol Node
• Virtual Machine storage
• Large Memory Analytics
Easy to enable
NSD Type localCache
Define only this node as NSD server
LROC LROC
Application Nodes
Performance Feature Spectrum Scale Local Read-Only Cache (LROC)
Benefits
• Speeds-up small writes
• Used by IBM Elastic Storage Server
Where to use it
• Logs handle small writes
• Any storage architecture
• Shared Disk
• Shared Nothing (Use replication)
• IO Sizes up to 64KiB
Easy to enable
Create a system.log pool
Enable write-cache on the file system
Application Nodes
Performance Feature Spectrum Scale Highly Available Write Cache (HAWC)
Flash
Local Storage
Shared Storage
Spectrum Scale Multicluster: cross-cluster sharing
22
• Cross-mounting file systems between Spectrum Scale clusters
• Separate clusters = separate administration domains
• When connection is established, all nodes are interconnected
– All nodes in both clusters must be within same IP network segment / VLAN
– Channel can be encrypted (openssl)
Synchronous Replication & Stretched Cluster
• Performed synchronously by the node who writes to disk
• Synchronous replication happens within Spectrum Scale cluster
• I/O it does not return to the application until both copies are written
• Active/Active data access
• Read from fastest source
• DR with automatic failover and seamless file-system recovery
• If replication between sites -> Spectrum Scale Stretched Cluster
Synchronous
replication
Application
Whichever
is fastest
23
Spectrum Scale Active File Management (AFM) • An asynchronous, cross-cluster, data-sharing utility
• Functions well over unreliable and high latency networks
• Extends global name space between multiple WAN dispersed locations to share and exchange data asynchronously
• Ca hes lo al opies of data distri uted to o e or ore lusters to i pro e lo al read and write performance
• As data is written or modified at one location, all other locations see that same data
24
Spectrum Scale AFM Main Concepts
• Home - Where the information lives. Owner of the data in a cache relationship
• Cache - Fileset in a remote cluster that points to home
• The relationship between a Cache and Home is one to one
• Cache knows about its Home. Home does not know a cache exists
• Data is copied to the cache when requested or data written at the cache is copied back to home as fast as possible
25
Spectrum Scale Server 1
Spectrum Scale Server 2
Clients
FDR IB
10/40 GbE
IBM Elastic Storage Server (ESS) is a Software Defined Solution
Migrate RAID
and disk
management
to commodity
file servers !
Custom dedicated
Disk Controllers
JBOD Disk
enclosures
Spectrum Scale Server 1
Spectrum Scale Server 2
Clients
Spectrum Scale RAID
Commodity file
servers
FDR IB
10/40 GbE
JBOD Disk
enclosures
Spectrum Scale RAID
Commodity file
servers with
RAID and disk
management
Spectrum Scale Native RAID is a software implementation of storage RAID technologies within
Spectrum Scale.
It requires special Licensing
It is only approved for pre-certified architectures such as Lenovo-GSS, IBM-ESS (Elastic Storage
Server)
Advantages of Spectrum Scale RAID
• Use of standard and i e pe sive disk drives • Erasure Code software implemented in Spectrum Scale
• Data is declustered and distributed to all disk drives with selected RAID protection
• 3-way, 4-way, RAID6 8+2P, RAID6 8+3P
• Faster rebuild times • As data is declustered, more disks are involved during rebuild
• Approx. 3.5 times faster than RAID-5
• Minimal impact of rebuild on system performance • Rebuild is done by many disks
• Rebuilds can be deferred with sufficient protection
• Better fault tolerance • End to end checksum
• Much higher mean-time-to-data-loss (MTTDL)
JBODs
Spectrum Scale RAID
RAID algorithm • Two types of RAID:
• 3 or 4 way replication
• 8 + 2 or 3 way parity
• 2-fault and 3-fault tolera t odes RAID-D2, RAID-D3
3-way Replication (1+2) 8 + 2p Reed Solomon 2-fault
tolerant
codes
3-fault
tolerant
codes
1 strip
(GPFS
block)
2 or 3
replicated
strips
4-way Replication (1+3)
8 strips
(GPFS block)
2 or 3
redundancy
strips
8 + 3p Reed Solomon
Rebuild overhead reduction example
| 31
Declustered RAID6 example
Critical Rebuild Performance on GL6 8+2p
JBODs
Spectrum Scale RAID
As one can see
during the critical
rebuild impact on
workload was high,
but as soon as we
were back to a
single parity
protection the
impact to the
customers
workload was <2%
Data Integrity Manager
prioritizes tasks:
Rebuild, Rebalance,
Data scrubbing and
proactive correction
6 minutes for a critical rebuild
End-to-end checksum • True end-to-end checksum fro disk surfa e to lie t s Spe tru S ale i terfa e
• Repairs soft/latent read errors
• Repairs lost/missing writes.
• Checksums are maintained on disk and in memory and are transmitted to/from client.
• Checksum is stored in a 64-byte trailer of 32-KiB buffers • 8-byte checksum and 56 bytes of ID and version info
• Sequence number used to detect lost/missing writes.
8 data strips 3 parity strips
32-KiB buffer
64B trailer
¼ to 2-KiB
terminus
IBM Elastic Storage Server family GS odels use U . JBODs or SSDs
Support drives: . TB, .8TB SAS, GB, 8 GB, . TB SSD .
GL odels use U . JBODs
Support drives: 4TB,6TB,8TB NL-SAS . HDDs
Supported NICs: 10GbE, 40GbE Ethernet and FDR or EDR Infiniband
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
FC
5887
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
FC
58
87
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
FC
58
87
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
FC
58
87
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
FC
58
87
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
FC
58
87
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
FC
58
87
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
FC
5887
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
FC
588
7
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
FC
5887
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
FC
58
87
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
FC
5887
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
FC
5887
Net Capacity
4TB = 327TB
6TB = 491TB
8TB = 655TB
Net Capacity
4TB = 673TB
6TB = 1PB
8TB = 1.3PB
Net Capacity
4TB = 1PB
6TB = 1.5PB
8TB = 2PB
Model GL4
Analytics and Cloud 4 Enclosures, 20U
232 NL-SAS, 2 SSD
10 to 16 GB/Sec
Model GL6
PetaScale Storage 6 Enclosures, 28U
348 NL-SAS, 2 SSD
10 to 25 GB/sec
Model GL2
Analytics Focused 2 Enclosures, 12U
116 NL-SAS, 2 SSD
5 - 8 GB/Sec
Model GS1 24 SSD
6 GB/Sec
Model GS2 46 SAS + 2 SSD or
48 SSD Drives
2 GB/Sec SAS
12 GB/Sec SSD
Model GS4 94 SAS + 2 SSD or
96 SSD Drives
5 GB/Sec SAS
16 GB/Sec SSD
Model GS6 142 SAS + 2 SSD
7 GB/Sec
Net Capacity
1.2TB = 121TB
1.6TB = 182TB
Net Capacity
400GB = 28TB
800GB = 57TB
1.6TB = 115TB
1.2TB = 78TB
1.6TB = 117TB
Net Capacity
400GB = 13TB
800GB = 26TB
1.6TB = 53TB
1.2TB = 35TB
1.6TB = 53TB
Net Capacity
400GB = 6TB
800GB = 13TB
1.6TB = 26TB
ESS New Models Performance and Capacity
Spectrum
Scale
ESS
New! Model GL2S: 2 Enclosures, 14U 166 NL-SAS, 2 SSD
New! Model GL4S: 4 Enclosures, 24U 334 NL-SAS, 2 SSD
New! Model GL6S: 6 Enclosures, 34U 502 NL-SAS, 2 SSD
ESS
5U84 Storage
ESS
5U84 Storage
Max: .9PB raw Max: 1.6PB raw Max: 1.8PB raw Max: 3.3PB raw Max: 2.8PB raw Max: 5PB raw
Model GL2: 2 Enclosures, 12U 116 NL-SAS, 2 SSD
ESS
5U84
Storage
ESS
5U84 Storage
ESS
5U84
Storage
ESS
5U84
Storage
Model GL6: 6 Enclosures, 28U 348 NL-SAS, 2 SSD
Model GL4: 4 Enclosures, 20U 232 NL-SAS, 2 SSD
ESS
5U84
Storage
ESS
5U84
Storage
ESS
5U84
Storage
ESS
5U84
Storage
ESS
5U84 Storage
ESS
5U84
Storage
34 GB/s
25 GB/s
17 GB/s
11 GB/s
8 GB/s
23 GB/s
Net Capacity
4TB = 1.5PB
8TB = 3.1PB
10TB = 3.9PB
Net Capacity
4TB = 1PB
8TB = 2PB
10TB = 2.5PB
Net Capacity
4TB = 508PB
8TB = 1PB
10TB = 1.27PB
Sequential throughput vs. Capacity
Software Defined Compute: IBM Platform Computing Delivering a highly utilized shared services environment optimized for time to results
Application Examples
• Simulation
• Analysis
• Design
• Big data
IT constrained
• Long wait times
• Low utilization
• IT Sprawl
IBM Platform Computing
Big Data /
Hadoop
Simulation
& Modeling Analytics
Traditional Software Defined
Benefits
• High utilization
• Throughput
• Performance
• Prioritization
• Reduced cost Repeated for many apps and groups
• Clusters
• Grid
• Cloud
Faster results
Fewer resources
Long Running
Services Make lots of computers look like “one”
Prioritized matching of supply with demand
Application
Overall Artificial Intelligence (AI) Space
39
Machine Learning
Deep Learning IT Systems break tasks into
Artificial Neural Networks
New Data
Sources:
NoSQL,
Hadoop &
Analytics
New class of applications
Machine Learing & Training
Pattern matching
Image
Real-time decision support
Complex workflows
Data Lakes
Extend Enterprise applications
Finance: Fraud detection /
prevention
Retail: shopping advisors
Healthcare: Diagnostics and
treatment
Supply chain and logistics
Extend Predictive Analytics to
Advance Analytics with AI
Human Intelligence Exhibited by Machines
Cognitive / ML/DL
“Human Trained” using large amounts of data & ability to learn how to perform the
task
Growing across Compute, Middleware, and Storage
PowerAI Platform
40
Caffe NVCaffe Torch IBMCaffe
DL4J TensorFlow
OpenBLAS
Theano
Deep Learning
Frameworks
Accelerated
Servers and
Infrastructure
for Scaling
Spectrum Scale:
High-Speed
Parallel File System
Scale to
Cloud
Cluster of NVLink
Servers
Coming Soon
Bazel DIGITS NCCL Distributed
Frameworks
Supporting
Libraries
Where to start?
• 20 x POWER8 cores with NvLink • Hasta 1TB DDR4 Mem with NvLink • Hasta Tesla P ’s . cores
+
Parallel Computing
Ej. Universidad Carlos III
Barcelona Supercomputing Center
GPU development
And optimisation
Ej. Molecular dynamics.
Centro de Biología Molecular
Machine Learning
Deep Learning
20 Core POWER8 + 256GB + 1
GPU Nvidia Volta
Starting at 27.500 € + IVA
IBM Power System S822LC
The Deep Learning Server
Questions?
42