Michael J. Miller
VP, Technology Innovation & System Applications
Second Generation Bandwidth Engine® IC
Breaks 4.5 Billion Accesses/sec
Hot Chips 25 Stanford Memorial Auditorium August 25-27, 2013
The Vision: Fast, Intelligent Access
Architecture
Packet Processor
0
1
n-1
n
Serial Link
Serial Link
Serial Link
Serial Link
…
…
…
Bandwidth Engine – Second Generation
12 GOp/s
4.5 GA
384 Gb/s
Bit Safe™
Technology
…
…
ingress egress
Multi-bank
Multi-partitions
allow for
high access
availability
Multi-threaded
Multi-Cores
allow for
high processing
throughput
Multi-linked
allow for
concurrent
transport
operations
Fixed Macro
ALUs for
functional
Acceleration
minimizes
intra-chip
traffic
Atomic
Memory
Operations
Allows
Extended
Carrier Class
& In package
Repair
ALU ALU
2 Copyright ©MoSys, Inc. 2013. All rights reserved.
Bandwidth Engine 2 Architecture & Family
Sampling Now
Parallel Array Architecture … Performance up to: 16 outstanding transactions
4.5G Accesses per second
192 Gbps full duplex throughput
~12ns deterministic read latency
2.7ns Random cycle time (tRC)
GigaChip Interface … 90% Efficient Transport Protocol Up to sixteen low latency SerDes lanes (8G to 15G)
High Reliability …70X better SER than 6T-SRAM Full ECC support; 72bit array and macro datapath operations
CRC protected and self recovering GigaChip Interface
SEU resistant 1T-SRAM Memory core: < 10 FIT/Mb
Bit Safe™ Self Test and Self Repair Option
`
Array Manager
GCI Port B
GCI Port A
TX RX
8 8 8 8
2M X 72
Burst,
Flat,
Macro
2M X 72
Burst,
Flat,
Macro
2M X 72
Burst,
Flat,
Macro
2M X 72
Burst,
Flat,
Macro
GCI Port GCI Port
Host
TX RX
RX TX RX TX
Applications BE1 - MSR576 MSR620 MSR820 MSR720
Lookup, LPM, Hash Highest single component Access Rate
Statistics Onboard ALU Onboard ALU+
Buffer – up to 80% efficient Per cycle Burst, Write Broadcast
Metering, Dual Ops Fixed Macro
Semaphore, Link List Atomic Ops
State, Queuing, Link List Dual Port w/ Data Coherency,
36 bit word access
Bandwidth Engine – 2nd Generation
3 Copyright ©MoSys, Inc. 2013. All rights reserved.
Bandwidth Engine MSR720 Architecture
GCI-B
GCI-B
15G Rx SerDes
15G Tx SerDes
GCI-A
15G Rx SerDes
GCI-A
15G Tx SerDes
Memory Controller BCR BCR BCR BCR
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 59
Bank 60
Bank 61
Bank 62
Bank 63
Memory Ops:
Rd/Wr 72b
Wr 36b
375 MHz Clock x
8 Reads +
8 Writes
6GA Internally
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 59
Bank 60
Bank 61
Bank 62
Bank 63
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 59
Bank 60
Bank 61
Bank 62
Bank 63
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 59
Bank 60
Bank 61
Bank 62
Bank 63
Results
Bank Conflict Resolution:
32K x 72b SRAM
Delayed Write Cache
4 x Partitions:
64 Single Port Banks
32K x 72b each
5 Copyright ©MoSys, Inc. 2013. All rights reserved.
A r r a y M a n a g e r
GCI Port B
GCI Port A
RX TX RX TX
TX RX TX RX
8 8 8 8
2M X 72
(1 bank)
2M X 72
(1 bank)
2M X 72
(1 bank)
2M X 72
(1 bank)
GCI Port GCI Port
Host
MSR720: Bandwidth Engine 2 – Access
Simultaneous Read & Write @ same
address
… treat each partition as a single bank
… >>2x the access performance of QDR SRAM • 4.5GA : 3 billion reads/sec, 1.5B 36b write/sec
72b & 36b words each access
12 ns read latency pin to pin
2.7ns tRC cycle time
8.5W @ 12.5G system power
… 90% efficient transport protocol Up to Sixteen 15G serial lanes (2 links of 8 lanes)
High Density
… 576Mbit 1T-SRAM ® memory core 19mm x 19mm package – 1mm pitch
High Reliability
… 70X better SER than 6T-based SRAM Memory core: < 10 FIT/Mb native
Interface: < 1 FIT
Bit Safe™ Self Test and Self Repair Option
Bandwidth Engine - Access
6 Copyright ©MoSys, Inc. 2013. All rights reserved.
-
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
0% 20% 40% 60% 80% 100%
Gig
a-A
cce
ss
es
/se
c (
GA
/s)
%Reads
MSR720: Basic ‘Dual Port’ Operation
Flat Partition Mode:
Double the effective bandwidth in worst case: • Bank Conflict Resolution (BCR) function allows for
simultaneous Read and Write of the same bank
• Implements dual port behavior with single port banks
Guarantees data coherency
3 billion accesses / sec @ 15G
Memory
Partition
BCR logic RDFP Data
WRFP Data
banks hidden =
‘flat’ addressing
BCR guarantees
full data
coherency
Dynamically
Scheduled
Writes
Performance @ 15G
Scheduling Balanced R:W
Balanced R:W Controller
Fra
me
Input (RX)
Fra
me
Output (TX)
Partition CMDARX CMDBRX QATX QATX
1 0/1
RDFP WRFP RDFP WRFP 16 RDFP RDFP
2 WD WD 17
3 2/3
RDFP WRFP RDFP WRFP 18 RDFP RDFP
4 WD WD 19
15G
1.50 GA Write
1.50 GA Read
3.00 GA Total
Re
ad
Wri
te
Wri
te
Rea
d
7 Copyright ©MoSys, Inc. 2013. All rights reserved.
“BCR” Implemented w/Delayed Write Cache
Conceptual Block Diagram
Bank 0 Bank 1 Bank 2
Bank n-1 Bank n
Delayed Write Cache Dual Port Memory
Write
Data
Read
Data
Data Data
Write Address
Bank Offset
Read Address
Bank Offset Addr Addr
Write Address
Bank #
Read Address
Bank # Addr Tag
Cached Read Data
Read Data
Hit
RAB # WAB # WAB Offset RAB Offset
Addr Tag
Cached Write Data
8 Copyright ©MoSys, Inc. 2013. All rights reserved.
Up to 4 Accesses Per Partition In One Cycle
3 Accesses Sustained Throughput
The MSR720 supports Bank Aware commands. This can be used to improve the data read performance on the interface,
however bypasses the BCR logic
Useful for unified memory applications combining dual port SRAM performance of “buffers” or “state” tables and read-only “lookup” tables
The two address ranges cannot overlap.
MSR720 supports 36b write operations: WRFP and WRBA Useful for small word size tables (pointers)
Increases write performance
Memory
Partition
BCR logic
WRBA Data
RDFP Data
RDBA Data
WRFP Data
banked array
banks hidden
‘flat’ addressing
Dynamically
Scheduled
Writes
9 Copyright ©MoSys, Inc. 2013. All rights reserved.
Write Flat Partition
Read Flat Partition
Read Bank Aware
Write Bank Aware
1
2
3 4
Mapping Ports and Banks
A r r a y M a n a g e r
GCI Port B
GCI Port A
RX TX RX TX
8 8 8 8
1M X 72
(1 bank)
1M X 72
(1 bank)
1M X 72
(1 bank)
1M X 72
(1 bank)
100G
100G
100G
100G
PPE
PPE
PPE
PPE
1M X 72
1M X 72
1M X 72
1M X 72
For State Memory
(WrFP, RdFP)
For other
functions like
Look Up Table
(RdBA) 10 Copyright ©MoSys, Inc. 2013. All rights reserved.
MSR720 Access Scheduling
Flat Partition guarantees no bank conflict between read and write to any address, even in the same cycle
Balanced R:W Controller (Basic 72b R:W Operation) Native BE Controller (72b WRITES)
Fram
e Input (RX)
Fram
e Output (TX)
Fram
e Input (RX)
Fram
e Output (TX) Partition CMDARX CMDBRX QATX QATX Partition CMDARX CMDBRX QATX QATX
1 0/1
RDFP WRFP RDFP WRFP 16 RDFP Data RDFP Data 1 0/1
RDFP RDBA RDFP RDBA 16 RDFP RDFP 2 WD WD 17 2 WRFP WDL WRFP WDL 17 RDBA RDBA 3
2/3 RDFP WRFP RDFP WRFP 18 RDFP Data RDFP Data 3
2/3 RDFP RDBA RDFP RDBA 18 RDFP RDFP
4 WD WD 19 4 WRFP WDL WRFP WDL 19 RDBA RDBA 15G 1
0/1 RDFP RDBA RDFP RDBA 16 RDFP RDFP
1.50 GA Write 2 WRFP WDU WRFP WDU 17 RDBA RDBA 1.50 GA Read 3
2/3 RDFP RDBA RDFP RDBA 18 RDFP RDFP
3.00 GA Total 4 WRFP WDU WRFP WDU 19 RDBA RDBA 15G
1.50 GA Read Native BE Controller (36b WRITES) 0.75 GA Write WDL = Lower 36b data
Fram
e Input (RX)
Fram
e Output (TX) 1.50 GA Read WDU = Upper 36b data
Partition CMDARX CMDBRX QATX QATX 3.75 GA Total 1
0/1 RDFP RDBA RDFP RDBA 16 RDFP RDFP
2 WRFP WD WRFP WD 17 RDBA RDBA Partition Access Restrictions 3
2/3 RDFP RDBA RDFP RDBA 18 RDFP RDFP
Fram
e RX 4 WRFP WD WRFP WD 19 RDBA RDBA Partition GCI-A GCI-B
15G 1 0/1 P0 P1
1.50 GA Read 2 1.50 GA Write (36b) 3
2/3 P2 P3
1.50 GA Read 4
4.50 GA Total
BA
FP FP
BA
FP FP
FP FP
11 Copyright ©MoSys, Inc. 2013. All rights reserved.
8 x SerDes
Rx Lanes 8 x SerDes
Rx Lanes
8 x SerDes
Tx Lanes 8 x SerDes
Tx Lanes
2.7
ns
WD = 72b data
MSR 720 : Breaking 4.5GA
-
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
0% 20% 40% 60% 80% 100%
Gig
a-A
cc
es
se
s/s
ec
(G
A/s
)
Read to Write Ratio (%Reads)
Second Gen BE:
MSR 720
Peak
w/15Gbps
SerDes
72b/36b FP Write
72b FP Read
72b BA Read
Sigma-Quad
or QDR SRAM
BL2 @ 600MHz
36b BA Write
First Gen BE
Note: Datasheet comparison - 1 access = 72 bits
- FP Access is arbitrated by BCR.
- BA Access is to/from array only. Must be bank aware.
RLDRAM3
12 Copyright ©MoSys, Inc. 2013. All rights reserved.
Bandwidth Engine 2 Architecture & Family
Parallel Array Architecture … Performance up to: 16 outstanding transactions
4.5G Accesses per second
192 Gbps full duplex throughput
~12ns deterministic read latency
2.7ns Random cycle time (tRC)
GigaChip Interface … 90% Efficient Transport Protocol Up to sixteen low latency SerDes lanes (8G to 15G)
High Reliability …70X better SER than 6T-SRAM Full ECC support; 72bit array and macro datapath operations
CRC protected and self recovering GigaChip Interface
SEU resistant 1T-SRAM Memory core: < 10 FIT/Mb
Bit Safe™ Self Test and Self Repair Option
`
Array Manager
GCI Port B
GCI Port A
TX RX
8 8 8 8
2M X 72
Burst,
Flat,
Macro
2M X 72
Burst,
Flat,
Macro
2M X 72
Burst,
Flat,
Macro
2M X 72
Burst,
Flat,
Macro
GCI Port GCI Port
Host
TX RX
RX TX RX TX
Applications BE1 - MSR576 MSR620 MSR820 MSR720
Lookup, LPM, Hash Highest single component Access Rate
Statistics Onboard ALU Onboard ALU+
Buffer – up to 80% efficient Per cycle Burst, Write Broadcast
Metering, Dual Ops Fixed Macro
Semaphore, Link List Atomic Ops
State, Queuing, Link List Dual Port w/ Data Coherency,
36 bit word access
Bandwidth Engine – 2nd Generation
14 Copyright ©MoSys, Inc. 2013. All rights reserved.
2M x 72 Partition
ECC
ECC
RA
RD Byte Count
WA WD WD
RD Packet Count
Addr 16b Imm
16
16
16b
64b
72
GCI Port
RX TX
16ns
ECC
ECC
1b 64b
+1
72
Dual Counter: 5 Stage Pipeline
Includes Index Compare Logic (not illustrated) for Data Forwarding:
In case of an index match, data is forwarded in the pipeline
Prevents stale data in the pipeline
Similar pipeline for Split Counter
End-to-End ECC Protection
16 ns to completion
15 Copyright ©MoSys, Inc. 2013. All rights reserved.
Individual Flow Programmability Meter Type
Flow Rates
Thresholds
8M Two Color Flows
4M Three Color Flows
Line Rate 4x100G
MSR820 – Metering Capabilities
CIR
EIR
CBS
EBS
> EBS
+
≤ EBS ≤ CBS
+
CIR
CBS
EBS
> EBS
+
≤ EBS ≤ CBS
+
Overflow
Single Rate Three Color Meter Two Rate Three Color Meter
Basic Meter (Two Color)
IR
BS
> BS ≤ BS
+
16 Copyright ©MoSys, Inc. 2013. All rights reserved.
Intelligent Offload, Fire Forward Architecture
MSR820 – Bandwidth Engine 2 –
Intelligent Memory Macros Leverages I/O
Second Generation Bandwidth Engine Architecture Up to 4.5 billion external memory accesses per second w/16 SerDes Lanes
Macros support up to 6 billion internal accesses per second w/8 SerDes Lane
Macros execute Atomically: Stats, Metering, Read & Set, Test & Set
Partition
Bandwidth Engine
f(χ)
Partition f(χ)
Partition f(χ)
Partition f(χ)
Packet Processor
Instruction
CIR
EIR
CBS
EBS
> EBS
+ ≤ EBS ≤ CBS
+
Stats Table +1
Byte Packet
Supports:
4 x 100GE ports
w/ Stats + Metering
over 8 lanes
17 Copyright ©MoSys, Inc. 2013. All rights reserved.
Conceptual Timing & Data Access Control
Analog D Q
Clock Data Recovery
Slip B
uf
DeS
cram
ble
r
Memory Macro
Serializer
Analog
Ou
tpu
t Reg
Frame Detect & Clock Div
Bit Clk
Lane Offset
Access Phase DLL Timing
Data Path
Timing Path
Rx & Bit Aligner
Lane/FrameAligner Serializer&Tx
Mem Access & RMW
BE 1: Read Latency of 15.9ns vs. BE2: Read Latency of ~12.5ns
19 Copyright ©MoSys, Inc. 2013. All rights reserved.
DeS
eria
lizer
Ou
tpu
t Reg
Instru
ction
P
ipelin
e
Core Clk
ALU
BCR
MSR720 Layout
BCR Cach
e
Ban
k 0
8x SerDes
8x SerDes
32K x 72b 32K x (72b + 8b + 6b +1b) Data + ECC + Tag + Valid
Partition 0
~6x
Area
SerDes I/O
5% of Die Area
@ 240 Gbps
20 Copyright ©MoSys, Inc. 2013. All rights reserved.
QDR like
Dual Port
Performance
for 1.10x
vs
3x die area
cost
Michael J. Miller
Thank you
Copyright ©MoSys, Inc. 2013. All rights reserved. 21
Hot Chips 25 Stanford Memorial Auditorium August 25-27, 2013
Top Related