Altera Technology Leadership with Hybrid Memory Cube Technology

Manish Deo Product Marketing Manager, High End FPGA, Altera Based in San Jose, California Focusing on :

− Next generation memory solutions , 3D /2.5D technology

Background (Previous Roles): − Product Engineering Manager − FPGA Test Development , IP Verification, Timing Model Development, − Methodology Development, 2.5D Test & Verification

Presenter Information

Industry Trends

Hybrid Memory Cube Technology

Altera Stratix V – HMC : Interoperatibilty platform

Performance Evaluation

Altera Generation 10 Portfolio

Agenda

Insatiable Need for Memory Bandwidth

– Global demand for mobility; Impact of cloud-based systems – Big data analytics challenge

0

20

40

60

80

100

120

140

2012 2013 2014 2015 2016 2017

Exab

ytes

Consumer - Online gaming

Business - File Sharing

Consumer - File sharing

Business - Web & Other data

Consumer - Web & Other data

Business - Video

Consumer - VideoCourtesy :Cisco

5

Network Processor Logic - Traffic Manager - Packet Processing - Switch Buffering

Blade

Rackmount Chassis

Networking Applications

PP Function Memories Used

Parsing M20K*

Packet Store M20K, DDR

Classification TCAM

Packet Editing M20K, QDR, RLD

Statistics M20K, DDR, RLD

Policing M20K, QDR, RLD

Forwarding DDR

TM Function Memories Used

Free List M20K, QDR, RLD

Linked List M20K, QDR, RLD

Queue & Buffer Management

QDR, DDR, RLD

nQ, dQ (head,tail ptrs)

QDR, RLD

Congestion Mgt.

QDR, RLD

Scheduler QDR, RLD

* M20K: Distributed embedded SRAM in Altera FPGA

Memory Intensive Applications

Wireline Application Memory Requirements

Data plane memory − Temporary storage of packets while they

await forwarding decision, DRAM − Require high capacity and bandwidth

Control plane memory − Storage of data for forwarding decision − Requires low latency and high random

transaction rate, SRAM.

6

System Implications

Aggregate bandwidth in and out of the package is not scaling with system throughput requirements − Number of pins in a typical high-end FPGA package has been roughly flat − IO & DDR memory data rate is not keeping up with application needs − Fixed number of package pins for IO and transceivers

Number of components for system-level integration is going

up to enable enhanced capabilities

Total system-level power budget has been relatively flat − Customers expect 2X improvement in performance/watt

7

3D Integration can address system level challenges with bandwidth, power, and board space

Emerging High Performance Memory

Memory vendors are addressing IO bandwidth constraints by architecting the memory and IO interface

Both serial and wide IO solution will likely co-exist

8

Control & Data Plane Memory

(QDR, RLDRAM, DDR)

Serial IO Interface

Parallel (Wide IO) Interface

Legacy Products

2.5D/3D Capable Memory

HMC : Broad Industry Adoption

HMCC Mission Promote widespread adoption and acceptance of an industry

standard serial interface and protocol for Hybrid Memory Cube

Hybrid Memory Cube Technology

Next Gen 3D Based Multi-Bank DRAM Memory − 128 independent banks employing state-of-the-art TSV technology − DRAM layers stacked on base-logic layer − Credit-based flow control, CRC protection, Automatic retry for failed transfers − Memory space divisible between links or as a single pool

Interface − 64 transceiver-based serial channels (16 Ch x 4 links) − Capable of supporting data rates of 15Gbps per link − Offers up to 1Tbps aggregate interface bandwidth − Advanced packet based protocol

10

400G System Design Example

11

Packet Buffer Requirements and Assumptions − 4 100GbE ports per Network Processor / Traffic Manager − Packet buffering on ingress or egress − Maintain 800Gbps effective bandwidth across all packet sizes at each packet buffer

* Courtesy Micron

400Gbps Packet Buffering: HMC Benefits

12

Area Benefit Design simplification One HMC device vs. 72 (48X16, 24X4 ECC) DDR4 DIMM’s Reduced active pins 4 HMC 15G SR link @12.5 Gbps vs. ~ 1900 DDR pins Smaller memory footprint Board area reduction Abstract memory Reduces system maintenance burden on host Higher energy efficiency Lower interface Pj/bit metric

Altera – Micron Technology Leadership with HMC

Altera – Micron Technology Leadership − First interoperability announcement September 2013* − Only FPGA to demonstrate interoperability at Super Computing 13*

13

Micron Booth

Altera Booth

Proven Technology with Stratix V Interoperability

14

15x more bandwidth, 70% less energy and 90% less board area

Hardware − Altera: 5SGXMA3K1F35C1N FPGA

Four FPGAs One “master” FPGA

− Micron: HMC Device Four separate x16 links

− Interop board I2C config of HMC Local power monitoring and control Altera Byte Blaster II USB port

Configuration Set-Up − Exercise 4 HMC links

− Data rate of 10Gbps /link − Full-width configuration (16 transceiver) link

− 16B, 32B,64B,128B data packets (power of two payload scheme)

Controller Communication & Execution Scheme

15

Typically PC or laptop station − Proprietary UDP packet format − Communication to Board using Ethernet (RGMII Interface) − Four Ethernet slots to communicate to 4 FPGAs

Configuration Bus between Master and Slave FPGA − Enables operation with only one Ethernet connection between host and

board, Master FPGA controls slave FPGAs

Execute continuous loop of requests − Stored in FPGA memory (as FLITS). − Define a "Calculation window" to monitor FLIT traffic counting the number

of flits and the flit requests with Data packet size. − Divide the number of data bits transmitted (determined by the request

type) by the amount of time the "calculation window" was run for

Controller Architecture

16

Packet Generator Responsible for generating quad FLIT packets (512 bits wide) 1K FLIT buffer using internal FPGA memory to store patterns Capable of 100% link saturation

Transmitter Updates Header and Tail of packets from Pattern Generator Manages functions such as Token based flow control, CRC insertion, etc, Includes retry buffer to accommodate link bit errors

Receiver Validates received packets Checks for correct framing, sequence numbers, and CRC Extracts flow control and retry pointers and feeds to Transmitter

Transceiver Full width Configuration (16 lanes per link) Sends/Received parallel 32 bit data over 16 lanes (512 bits wide) Implements link negotiation and training at startup Performs scramble/descramble of data Implements clock domain crossings between core clock and link clocks

Protocol analyzer Raw TX and RX FLIT capture Allows detailed analysis of link traffic

Command Histogram Accumulates counts of all TX and RX commands Allows for analysis of link traffic patterns

Latency Histogram Allows latency measurement of 2 commands or groups of commands Commands and groups are programmable

Theoretical Performance Number

Link speed: 10Gbps, Link width: 16-lanes − Theoretical bandwidth = 160Gb/s (20GB/s) per direction − For 128-byte packet, 9 FLITs for payload (1 FLIT = 128 bits) − 1 FLIT for header and tail, 8 FLITs for payload

Traffic Pattern Scenarios − 100% Write, 100% Read, 50% Write/Read with non-posted write

17

50% Write (non posted) and 50% Read − Theoretical number: 8 / 10 * 20 = 16GB/s − Actual number in test: ~16GB/s

FPGA 128B Write (9 FLITs) Read

(1FLIT)

Wr Re* (1 FLIT)

Read Back Data (9 FLITs)

HMC

HMC Demo Example

GUI based demo developed − Ability to showcase key parameters real time across multiple user configuration settings

Maximum Bandwidth example shown below − 128 B data payload, 50 % access ratio (half TX and half RX), Link rate 10Gbps − Total aggregate bandwidth across 4 links (TX + RX) ~ 128GB/s (green line in aggregate window)

Configuration Controller Window

Aggregate Link Display Window

Link Display Window

Target Markets and Applications

19

Market Value

Wireline Communications Pin-efficient bandwidth (400G) PCB density

High Performance Computing PCB density High bandwidth Reduced power

Military Radar - high BW Signaling intelligence (Search) – Parallel search/matching algorithms

Test & Measurement High bandwidth for data capture of ultra-high speed analog waveforms

Enabling higher performance and lower power across many applications

Intel 14 nm Tri-Gate process 2x performance increase or

70% power savings 64-bit quad core ARM A53 3D-capable for integrating

SRAM, DRAM, ASIC Up to 144 transceiver

channels 32 Gbps chip-to-chip

Breakthrough Advantage with Generation 10 Devices

20

Delivering Unimaginable Performance

Reinventing the Midrange

TSMC 20 nm process 15% higher performance than

current high-end with 40% lower midrange power

Compatible 32-bit dual ARM A9 with 1.6x processor system improvement

Up to 96 transceiver channels 28 Gbps chip-to-chip

Generation 10 FPGA and SoCs deliver Hybrid Memory Cube for volume production

Need more Information?

21

Altera HMC Landing Page (Video, White paper, roadmap) − http://www.altera.com/hmc

Micron HMC Landing Page

− http://www.micron.com

Joint Press Release

− http://newsroom.altera.com/press-releases/nr-altera-micron-hmc.htm

Press Coverage

− EE Times : http://www.eetimes.com/document.asp?doc_id=1319391 − EBN: http://www.ebnonline.com/author.asp?section_id=3382&doc_id=268265

HMCC Consortium

− http://www.hybridmemorycube.org/

ACE/EDN/EE Times Awards 2014

Altera /Micron HMC Inter op Team − Winner : Design Team of the year

Back Up

23

Altera Technology Leadership with Hybrid Memory Cube Technology · 2019-10-11 · − 1 FLIT for...

Documents

Transcript of Altera Technology Leadership with Hybrid Memory Cube Technology · 2019-10-11 · − 1 FLIT for...