Title of Presentation - Flash Memory Summit and … · Acceleration Paradigms with Great...
Transcript of Title of Presentation - Flash Memory Summit and … · Acceleration Paradigms with Great...
Flash Memory Summit 2017
Santa Clara, CA 1
OpenCAPITM
OverviewFlash Memory Summit 2017
Open Coherent Accelerator Processor Interface
Flash Memory Summit 2017
Santa Clara, CA 2
Accelerated Computing and High Performance Bus
Attributes driving Accelerators
• Emergence of complex storage and memory solutions
• Introduction of device coherency requirements (IBM’s introduction in 2013)
• Growing demand for network performance
• Various form factors (e.g., GPUs, FPGAs, ASICs, etc.)
Driving factors for a high performance bus - Consider the environment
• Increased industry dependence on hardware acceleration for performance
• Hyperscale datacenters and HPC are driving need for much higher network bandwidth
• Deep learning and HPC require more bandwidth between accelerators and memory
• New memory/storage technologies are increasing the need for bandwidth with low latency
Computation Data Access
Flash Memory Summit 2017
Santa Clara, CA 3
Two Bus Challenges
1. High performance coherent bus needed
• Hardware acceleration will become commonplace, but….
• If you are going to use Advanced Memory/Storage technology and Accelerators, you need to get data in/out very quickly
• Today’s system interfaces are insufficient to address this requirement
• Systems must be able to integrate multiple memory technologies with different access methods, coherency and performance attributes
• Traditional I/O architecture results in very high CPU overhead when applications communicate with I/O or Accelerator devices
2. These challenges must be addressed in an open architecture allowing full industry participation
• Architecture agnostic to enable the ecosystem growth and adaption
• Establish sufficient volume base to drive cost down
• Support broad ecosystem of software and attached devices
Flash Memory Summit 2017
Santa Clara, CA 4
OpenCAPI Advantages for Storage Class Memories
• Open standard interface enables to attach wide range of devices
• Ability to support a wide range of access models from byte addressable load/store to block
• Extreme bandwidth beyond classical storage interfaces
• OpenCAPI feature of Home Agent Memory geared specifically for storage class memory paradigms
• Agnostic interface allows extension to evolving memory technologies in the future (e.g., compute-in-memory)
• Common physical interface between non-memory and memory devices
Where are we coming from today? CAPI Technology Unlocks the Next Level of Performance for Flash
Identical hardware with 3 different paths to data
FlashSystem
Conventional
I/O (FC)
Legacy CAPI 1.0 –
External Flash Drawer
IBM POWER S822L
Legacy CAPI 1.0 -
Integrated Card
IBM's Legacy CAPI
1.0 NVMe Flash
Accelerator is almost
5X more efficient in
performing IO vs
traditional storage. 21%
35%
56%
100%
0%
25%
50%
75%
100%
CAPINVMe TraditionalNVMe TraditionalStorage-DirectIO
TraditionalStorage-Filesystem
RelativeCAPIvs.NVMeInstructionCountsperIO
KernelInstructions UserInstructions
Legacy CAPI 1.0 -
accelerated NVMe
Flash can issue
3.7X more IOs
per CPU thread
than regular NVMe
flash.
Improves scaling and resiliency
Caching with persistent data frames
New solutions via large scaling
Comparison of Memory Paradigms
Needle-in-a-haystack Engine
Main Memory
Processor Chip
DD
R4
/5DataDLx/TLx
Example: Basic DDR attach
Processor ChipDLx/TLx
Emerging Storage Class Memory
Processor Chip DataDLx/TLx
Tiered Memory
SCM
DD
R4
/5 DataDLx/TLx
SCM
OpenCAPI WINS due to Bandwidth, best of
breed latency, and flexibility of an Open
architecture
JOIN TODAY!www.opencapi.org
Acceleration Paradigms with Great Performance
Examples: Encryption, Compression, Erasure prior to network or storage
Processor Chip
Acc
Data
Egress Transform
DLx/TLx
Processor Chip
Acc
Data
Bi-Directional Transform
Acc
TLx/DLx
Examples: NoSQL such as Neo4J with Graph Node Traversals, etc
Needle-in-a-haystack Engine
Examples: Machine or Deep Learning potentially using OpenCAPI attached memory
Memory Transform
Processor Chip
Acc
DataDLx/TLx
Example: Basic work offload
Processor Chip
Acc
NeedlesDLx/TLx
Examples: Database searches, joins, intersections, merges
Ingress Transform
Processor Chip
Acc
DataDLx/TLx
Examples: Video Analytics, HFT, VPN/IPsec/SSL, Deep Packet Inspection (DPI),
Data Plane Accelerator (DPA), Video Encoding (H.265), etc
Needle-In-A-Haystack Engine
Haystack
Data
Flash Memory Summit 2017
Santa Clara, CA 8
Data Centric Computing with OpenCAPITM
Flash Memory Summit 2017
Allan Cantle – CTO & Founder Nallatech
9
Server qualified accelerator cards featuring FPGAs, network I/O and an open architecture software/firmware framework. Design Services/Application Optimisation
Nallatech – a Molex company
24 years of FPGA Computing heritage
Data Centric High Performance Heterogeneous Computing
Real-time, low latency network and I/O processing
Intel PSG (Altera) OpenCL & Xilinx Alliance partner
Member of OpenCAPI, GenZ & OpenPOWER
Server partners: Cray, DELL, HPE, IBM, Lenovo
Application porting & optimization services
Successfully deployed high volumes of FPGA accelerators
Nallatech at a Glance
Data Centric Architectures - Fundamental Principles
1. Consume Zero Power when Data is Idle
2. Don’t Move the Data unless you absolutely have to
3. When Data has to Move, Move it as efficiently as possible
Our guiding light……….
The value is in the Data!& the CPU core can often be effectively free!
11
Data Center Architectures, Blending Evolutionary with Revolutionary
OpenCAPI OpenCAPI OpenCAPI
FPGA FPGA FPGA
Emerging Data Centric
EnhancementsSCM / Flash
SCM / Flash
SCM / Flash
CPU CPU CPU
Existing DataCenter
Infrastructure
Memory Memory
Memory
Existing DataCenter
Infrastructure
Emerging Data Centric
Enhancements
Nallatech HyperConverged & Disaggregatable Server
Leverage Google & Rackspace’s OCP Zaius/Barreleye G2 platform
Reconfigurable FPGA Fabric with Balanced Bandwidth to CPU, Storage & Data Plane Network
OpenCAPI provides Low Latency & coherent Accelerator / Processor Interface
GenZ Memory-Semantic Fabric provides Addressable shared memory up to 32 Zetabytes
200GBytes/s 200GBytes/s
170GB/s 170GB/s
4x OpenCAPI Channels200GBytes/s
Xilinx Zynq US+ 0.5OU High Storage Accelerator Blade
4 FSAs in 2OU Rackspace Barreleye G2 OCP Storage drawer deliver :-
• 152 GByte/s PFD* Bandwidth to 1TB of DDR4 Memory
• 256 GByte/s PFD* Bandwidth to 64TB of Flash
• 200 GByte/s PFD* Bandwidth through the OpenCAPI channels
• 200 GByte/s PFD* Bandwidth through the GenZ Fabric IO
Open Architecture software/firmware framework
Reconfigurable Hardware Dataplane, Flash Storage Accelerator – FSA
128GByte RDIMM
DDR4 Memory
@ 2400MTPS
PCIe
Gen 3
Switch
Zynq US+ ZU19EG FFVC1760
8GByte DDR4PCIe G2 x 4
Control Plane Interface
x72
X8
X72
SlimSAS
Connector
PCIe x16 G3100GbE QSFP28
100GbE QSFP28
X4
X4
M.2
22110
SSD
M.2
22110
SSD
M.2
22110
SSD
M.2
22110
SSD
M.2
22110
SSD
M.2
22110
SSD
M.2
22110
SSD
M.2
22110
SSD
OpenCAPI
InterfacePCIe x16 G3
8x PCIe
x4 G3
128GByte
DDR4 RDIMM
GenZ Data
Plane I/O
x72
MPSoC
*PFD = Peak Full Duplex
Summary
OpenCAPI Accelerator to Processor Interface Benefits• Coherency
• Lowest Latency
• Highest Bandwidth
• Open Standard
• Perfect Bridge to blend CPU Centric & Data Centric Architectures
Join the Open Community where independent experts innovate together and
you can help to decide on big topics like whether :-
Separate Control and Data Planes -- are better than -- Converged ones