Altera Technology Leadership with Hybrid Memory Cube Technology · 2019-10-11 · − 1 FLIT for...
Transcript of Altera Technology Leadership with Hybrid Memory Cube Technology · 2019-10-11 · − 1 FLIT for...
Altera Technology Leadership with Hybrid Memory Cube Technology
Manish Deo Product Marketing Manager, High End FPGA, Altera Based in San Jose, California Focusing on :
− Next generation memory solutions , 3D /2.5D technology
Background (Previous Roles): − Product Engineering Manager − FPGA Test Development , IP Verification, Timing Model Development, − Methodology Development, 2.5D Test & Verification
Presenter Information
Industry Trends
Hybrid Memory Cube Technology
Altera Stratix V – HMC : Interoperatibilty platform
Performance Evaluation
Altera Generation 10 Portfolio
Agenda
Insatiable Need for Memory Bandwidth
– Global demand for mobility; Impact of cloud-based systems – Big data analytics challenge
0
20
40
60
80
100
120
140
2012 2013 2014 2015 2016 2017
Exab
ytes
Consumer - Online gaming
Business - File Sharing
Consumer - File sharing
Business - Web & Other data
Consumer - Web & Other data
Business - Video
Consumer - VideoCourtesy :Cisco
5
Network Processor Logic - Traffic Manager - Packet Processing - Switch Buffering
Blade
Rackmount Chassis
Networking Applications
PP Function Memories Used
Parsing M20K*
Packet Store M20K, DDR
Classification TCAM
Packet Editing M20K, QDR, RLD
Statistics M20K, DDR, RLD
Policing M20K, QDR, RLD
Forwarding DDR
TM Function Memories Used
Free List M20K, QDR, RLD
Linked List M20K, QDR, RLD
Queue & Buffer Management
QDR, DDR, RLD
nQ, dQ (head,tail ptrs)
QDR, RLD
Congestion Mgt.
QDR, RLD
Scheduler QDR, RLD
* M20K: Distributed embedded SRAM in Altera FPGA
Memory Intensive Applications
Wireline Application Memory Requirements
Data plane memory − Temporary storage of packets while they
await forwarding decision, DRAM − Require high capacity and bandwidth
Control plane memory − Storage of data for forwarding decision − Requires low latency and high random
transaction rate, SRAM.
6
System Implications
Aggregate bandwidth in and out of the package is not scaling with system throughput requirements − Number of pins in a typical high-end FPGA package has been roughly flat − IO & DDR memory data rate is not keeping up with application needs − Fixed number of package pins for IO and transceivers
Number of components for system-level integration is going
up to enable enhanced capabilities
Total system-level power budget has been relatively flat − Customers expect 2X improvement in performance/watt
7
3D Integration can address system level challenges with bandwidth, power, and board space
Emerging High Performance Memory
Memory vendors are addressing IO bandwidth constraints by architecting the memory and IO interface
Both serial and wide IO solution will likely co-exist
8
Control & Data Plane Memory
(QDR, RLDRAM, DDR)
Serial IO Interface
Parallel (Wide IO) Interface
Legacy Products
2.5D/3D Capable Memory
HMC : Broad Industry Adoption
HMCC Mission Promote widespread adoption and acceptance of an industry
standard serial interface and protocol for Hybrid Memory Cube
Hybrid Memory Cube Technology
Next Gen 3D Based Multi-Bank DRAM Memory − 128 independent banks employing state-of-the-art TSV technology − DRAM layers stacked on base-logic layer − Credit-based flow control, CRC protection, Automatic retry for failed transfers − Memory space divisible between links or as a single pool
Interface − 64 transceiver-based serial channels (16 Ch x 4 links) − Capable of supporting data rates of 15Gbps per link − Offers up to 1Tbps aggregate interface bandwidth − Advanced packet based protocol
10
400G System Design Example
11
Packet Buffer Requirements and Assumptions − 4 100GbE ports per Network Processor / Traffic Manager − Packet buffering on ingress or egress − Maintain 800Gbps effective bandwidth across all packet sizes at each packet buffer
* Courtesy Micron
400Gbps Packet Buffering: HMC Benefits
12
Area Benefit Design simplification One HMC device vs. 72 (48X16, 24X4 ECC) DDR4 DIMM’s Reduced active pins 4 HMC 15G SR link @12.5 Gbps vs. ~ 1900 DDR pins Smaller memory footprint Board area reduction Abstract memory Reduces system maintenance burden on host Higher energy efficiency Lower interface Pj/bit metric
Altera – Micron Technology Leadership with HMC
Altera – Micron Technology Leadership − First interoperability announcement September 2013* − Only FPGA to demonstrate interoperability at Super Computing 13*
13
Micron Booth
Altera Booth
Proven Technology with Stratix V Interoperability
14
15x more bandwidth, 70% less energy and 90% less board area
Hardware − Altera: 5SGXMA3K1F35C1N FPGA
Four FPGAs One “master” FPGA
− Micron: HMC Device Four separate x16 links
− Interop board I2C config of HMC Local power monitoring and control Altera Byte Blaster II USB port
Configuration Set-Up − Exercise 4 HMC links
− Data rate of 10Gbps /link − Full-width configuration (16 transceiver) link
− 16B, 32B,64B,128B data packets (power of two payload scheme)
Controller Communication & Execution Scheme
15
Typically PC or laptop station − Proprietary UDP packet format − Communication to Board using Ethernet (RGMII Interface) − Four Ethernet slots to communicate to 4 FPGAs
Configuration Bus between Master and Slave FPGA − Enables operation with only one Ethernet connection between host and
board, Master FPGA controls slave FPGAs
Execute continuous loop of requests − Stored in FPGA memory (as FLITS). − Define a "Calculation window" to monitor FLIT traffic counting the number
of flits and the flit requests with Data packet size. − Divide the number of data bits transmitted (determined by the request
type) by the amount of time the "calculation window" was run for
Controller Architecture
16
Packet Generator Responsible for generating quad FLIT packets (512 bits wide) 1K FLIT buffer using internal FPGA memory to store patterns Capable of 100% link saturation
Transmitter Updates Header and Tail of packets from Pattern Generator Manages functions such as Token based flow control, CRC insertion, etc, Includes retry buffer to accommodate link bit errors
Receiver Validates received packets Checks for correct framing, sequence numbers, and CRC Extracts flow control and retry pointers and feeds to Transmitter
Transceiver Full width Configuration (16 lanes per link) Sends/Received parallel 32 bit data over 16 lanes (512 bits wide) Implements link negotiation and training at startup Performs scramble/descramble of data Implements clock domain crossings between core clock and link clocks
Protocol analyzer Raw TX and RX FLIT capture Allows detailed analysis of link traffic
Command Histogram Accumulates counts of all TX and RX commands Allows for analysis of link traffic patterns
Latency Histogram Allows latency measurement of 2 commands or groups of commands Commands and groups are programmable
Theoretical Performance Number
Link speed: 10Gbps, Link width: 16-lanes − Theoretical bandwidth = 160Gb/s (20GB/s) per direction − For 128-byte packet, 9 FLITs for payload (1 FLIT = 128 bits) − 1 FLIT for header and tail, 8 FLITs for payload
Traffic Pattern Scenarios − 100% Write, 100% Read, 50% Write/Read with non-posted write
17
50% Write (non posted) and 50% Read − Theoretical number: 8 / 10 * 20 = 16GB/s − Actual number in test: ~16GB/s
FPGA 128B Write (9 FLITs) Read
(1FLIT)
Wr Re* (1 FLIT)
Read Back Data (9 FLITs)
HMC
HMC Demo Example
GUI based demo developed − Ability to showcase key parameters real time across multiple user configuration settings
Maximum Bandwidth example shown below − 128 B data payload, 50 % access ratio (half TX and half RX), Link rate 10Gbps − Total aggregate bandwidth across 4 links (TX + RX) ~ 128GB/s (green line in aggregate window)
Configuration Controller Window
Aggregate Link Display Window
Link Display Window
Target Markets and Applications
19
Market Value
Wireline Communications Pin-efficient bandwidth (400G) PCB density
High Performance Computing PCB density High bandwidth Reduced power
Military Radar - high BW Signaling intelligence (Search) – Parallel search/matching algorithms
Test & Measurement High bandwidth for data capture of ultra-high speed analog waveforms
Enabling higher performance and lower power across many applications
Intel 14 nm Tri-Gate process 2x performance increase or
70% power savings 64-bit quad core ARM A53 3D-capable for integrating
SRAM, DRAM, ASIC Up to 144 transceiver
channels 32 Gbps chip-to-chip
Breakthrough Advantage with Generation 10 Devices
20
Delivering Unimaginable Performance
Reinventing the Midrange
TSMC 20 nm process 15% higher performance than
current high-end with 40% lower midrange power
Compatible 32-bit dual ARM A9 with 1.6x processor system improvement
Up to 96 transceiver channels 28 Gbps chip-to-chip
Generation 10 FPGA and SoCs deliver Hybrid Memory Cube for volume production
Need more Information?
21
Altera HMC Landing Page (Video, White paper, roadmap) − http://www.altera.com/hmc
Micron HMC Landing Page
− http://www.micron.com
Joint Press Release
− http://newsroom.altera.com/press-releases/nr-altera-micron-hmc.htm
Press Coverage
− EE Times : http://www.eetimes.com/document.asp?doc_id=1319391 − EBN: http://www.ebnonline.com/author.asp?section_id=3382&doc_id=268265
HMCC Consortium
− http://www.hybridmemorycube.org/
ACE/EDN/EE Times Awards 2014
Altera /Micron HMC Inter op Team − Winner : Design Team of the year
Back Up
23