Title of Presentation - Flash Memory Summit and … · Acceleration Paradigms with Great...

Flash Memory Summit 2017

Santa Clara, CA 1

OpenCAPITM

OverviewFlash Memory Summit 2017

Open Coherent Accelerator Processor Interface


Santa Clara, CA 2

Accelerated Computing and High Performance Bus

Attributes driving Accelerators

• Emergence of complex storage and memory solutions

• Introduction of device coherency requirements (IBM’s introduction in 2013)

• Growing demand for network performance

• Various form factors (e.g., GPUs, FPGAs, ASICs, etc.)

Driving factors for a high performance bus - Consider the environment

• Increased industry dependence on hardware acceleration for performance

• Hyperscale datacenters and HPC are driving need for much higher network bandwidth

• Deep learning and HPC require more bandwidth between accelerators and memory

• New memory/storage technologies are increasing the need for bandwidth with low latency

Computation Data Access


Santa Clara, CA 3

Two Bus Challenges

1. High performance coherent bus needed

• Hardware acceleration will become commonplace, but….

• If you are going to use Advanced Memory/Storage technology and Accelerators, you need to get data in/out very quickly

• Today’s system interfaces are insufficient to address this requirement

• Systems must be able to integrate multiple memory technologies with different access methods, coherency and performance attributes

• Traditional I/O architecture results in very high CPU overhead when applications communicate with I/O or Accelerator devices

2. These challenges must be addressed in an open architecture allowing full industry participation

• Architecture agnostic to enable the ecosystem growth and adaption

• Establish sufficient volume base to drive cost down

• Support broad ecosystem of software and attached devices


Santa Clara, CA 4

OpenCAPI Advantages for Storage Class Memories

• Open standard interface enables to attach wide range of devices

• Ability to support a wide range of access models from byte addressable load/store to block

• Extreme bandwidth beyond classical storage interfaces

• OpenCAPI feature of Home Agent Memory geared specifically for storage class memory paradigms

• Agnostic interface allows extension to evolving memory technologies in the future (e.g., compute-in-memory)

• Common physical interface between non-memory and memory devices

Where are we coming from today? CAPI Technology Unlocks the Next Level of Performance for Flash

Identical hardware with 3 different paths to data

FlashSystem

Conventional

I/O (FC)

Legacy CAPI 1.0 –

External Flash Drawer

IBM POWER S822L

Legacy CAPI 1.0 -

Integrated Card

IBM's Legacy CAPI

1.0 NVMe Flash

Accelerator is almost

5X more efficient in

performing IO vs

traditional storage. 21%

35%

56%

100%

0%

25%

50%

75%

100%

CAPINVMe TraditionalNVMe TraditionalStorage-DirectIO

TraditionalStorage-Filesystem

RelativeCAPIvs.NVMeInstructionCountsperIO

KernelInstructions UserInstructions

Legacy CAPI 1.0 -

accelerated NVMe

Flash can issue

3.7X more IOs

per CPU thread

than regular NVMe

flash.

Improves scaling and resiliency

Caching with persistent data frames

New solutions via large scaling

Comparison of Memory Paradigms

Needle-in-a-haystack Engine

Main Memory

Processor Chip

DD

R4

/5DataDLx/TLx

Example: Basic DDR attach

Processor ChipDLx/TLx

Emerging Storage Class Memory

Processor Chip DataDLx/TLx

Tiered Memory

SCM

DD

R4

/5 DataDLx/TLx

SCM

OpenCAPI WINS due to Bandwidth, best of

breed latency, and flexibility of an Open

architecture

JOIN TODAY!www.opencapi.org

http://www.opencapi.org/

Acceleration Paradigms with Great Performance

Examples: Encryption, Compression, Erasure prior to network or storage

Processor Chip

Acc

Data

Egress Transform

DLx/TLx

Processor Chip

Acc

Data

Bi-Directional Transform

Acc

TLx/DLx

Examples: NoSQL such as Neo4J with Graph Node Traversals, etc

Needle-in-a-haystack Engine

Examples: Machine or Deep Learning potentially using OpenCAPI attached memory

Memory Transform

Processor Chip

Acc

DataDLx/TLx

Example: Basic work offload

Processor Chip

Acc

NeedlesDLx/TLx

Examples: Database searches, joins, intersections, merges

Ingress Transform

Processor Chip

Acc

DataDLx/TLx

Examples: Video Analytics, HFT, VPN/IPsec/SSL, Deep Packet Inspection (DPI),

Data Plane Accelerator (DPA), Video Encoding (H.265), etc

Needle-In-A-Haystack Engine

Haystack

Data


Santa Clara, CA 8

Data Centric Computing with OpenCAPITM


Allan Cantle – CTO & Founder Nallatech

[email protected]

9

Server qualified accelerator cards featuring FPGAs, network I/O and an open architecture software/firmware framework. Design Services/Application Optimisation

Nallatech – a Molex company

24 years of FPGA Computing heritage

Data Centric High Performance Heterogeneous Computing

Real-time, low latency network and I/O processing

Intel PSG (Altera) OpenCL & Xilinx Alliance partner

Member of OpenCAPI, GenZ & OpenPOWER

Server partners: Cray, DELL, HPE, IBM, Lenovo

Application porting & optimization services

Successfully deployed high volumes of FPGA accelerators

Nallatech at a Glance

Data Centric Architectures - Fundamental Principles

1. Consume Zero Power when Data is Idle

2. Don’t Move the Data unless you absolutely have to

3. When Data has to Move, Move it as efficiently as possible

Our guiding light……….

The value is in the Data!& the CPU core can often be effectively free!

11

Data Center Architectures, Blending Evolutionary with Revolutionary

OpenCAPI OpenCAPI OpenCAPI

FPGA FPGA FPGA

Emerging Data Centric

EnhancementsSCM / Flash

SCM / Flash

SCM / Flash

CPU CPU CPU

Existing DataCenter

Infrastructure

Memory Memory

Memory

Existing DataCenter

Infrastructure

Emerging Data Centric

Enhancements

Nallatech HyperConverged & Disaggregatable Server

Leverage Google & Rackspace’s OCP Zaius/Barreleye G2 platform

Reconfigurable FPGA Fabric with Balanced Bandwidth to CPU, Storage & Data Plane Network

OpenCAPI provides Low Latency & coherent Accelerator / Processor Interface

GenZ Memory-Semantic Fabric provides Addressable shared memory up to 32 Zetabytes

200GBytes/s 200GBytes/s

170GB/s 170GB/s

4x OpenCAPI Channels200GBytes/s

Xilinx Zynq US+ 0.5OU High Storage Accelerator Blade

4 FSAs in 2OU Rackspace Barreleye G2 OCP Storage drawer deliver :-

• 152 GByte/s PFD* Bandwidth to 1TB of DDR4 Memory

• 256 GByte/s PFD* Bandwidth to 64TB of Flash

• 200 GByte/s PFD* Bandwidth through the OpenCAPI channels

• 200 GByte/s PFD* Bandwidth through the GenZ Fabric IO

Open Architecture software/firmware framework

Reconfigurable Hardware Dataplane, Flash Storage Accelerator – FSA

128GByte RDIMM

DDR4 Memory

@ 2400MTPS

PCIe

Gen 3

Switch

Zynq US+ ZU19EG FFVC1760

8GByte DDR4PCIe G2 x 4

Control Plane Interface

x72

X8

X72

SlimSAS

Connector

PCIe x16 G3100GbE QSFP28

100GbE QSFP28

X4

X4

M.2

22110

SSD

M.2

22110

SSD

M.2

22110

SSD

M.2

22110

SSD

M.2

22110

SSD

M.2

22110

SSD

M.2

22110

SSD

M.2

22110

SSD

OpenCAPI

InterfacePCIe x16 G3

8x PCIe

x4 G3

128GByte

DDR4 RDIMM

GenZ Data

Plane I/O

x72

MPSoC

*PFD = Peak Full Duplex

Summary

OpenCAPI Accelerator to Processor Interface Benefits• Coherency

• Lowest Latency

• Highest Bandwidth

• Open Standard

• Perfect Bridge to blend CPU Centric & Data Centric Architectures

Join the Open Community where independent experts innovate together and

you can help to decide on big topics like whether :-

Separate Control and Data Planes -- are better than -- Converged ones

Title of Presentation - Flash Memory Summit and … · Acceleration Paradigms with Great...

Documents

Transcript of Title of Presentation - Flash Memory Summit and … · Acceleration Paradigms with Great...