BeiHang Short Course, Part 6: CoRAM FPGA...
Transcript of BeiHang Short Course, Part 6: CoRAM FPGA...
![Page 1: BeiHang Short Course, Part 6: CoRAM FPGA “Architecture”users.ece.cmu.edu/~jhoe/distribution/bhsc14/BH6-CoRAM.pdf · 9/24/2014 2 Why doesn’t everybody want to compute with FPGAs?](https://reader033.fdocuments.us/reader033/viewer/2022050717/5e14de4550ec9444177a67a6/html5/thumbnails/1.jpg)
9/24/2014
1
BeiHang Short Course, Part 6: CoRAM FPGA “Architecture”
James C. Hoe
Department of ECE
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s1
Department of ECE
Carnegie Mellon University
Collaborators: Eric S. Chung (MSR), Gabriel Weisz, Michael Papamichael, and Ken Mai
Eric S. Chung, James C. Hoe, and Kenneth Mai, “CoRAM: An In‐Fabric Memory Architecture for FPGA‐based Computing,” Proc. ACM International Symposium on Field‐Programmable Gate Arrays (FPGA), pp 97‐106, February 2011.
Moore’s Law for FPGA
10TFLOPSStratix 10
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s2
2000 2005 2010 2015
![Page 2: BeiHang Short Course, Part 6: CoRAM FPGA “Architecture”users.ece.cmu.edu/~jhoe/distribution/bhsc14/BH6-CoRAM.pdf · 9/24/2014 2 Why doesn’t everybody want to compute with FPGAs?](https://reader033.fdocuments.us/reader033/viewer/2022050717/5e14de4550ec9444177a67a6/html5/thumbnails/2.jpg)
9/24/2014
2
Why doesn’t everybody want to compute with FPGAs?
“ di i ll G h b h b d
G d O d i d f i
“Traditionally, FPGAs have been the bastardstep‐brother of ASICs…”
Proceedings of ISFPGA 2004
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s3
• FPGAs today NOT designed for computing
– tools and languages difficult to use
– users exposed to device and platform details
– no application portability
Outline
• Motivations
• CoRAM Architecture
• Current work and evaluations
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s4
• Conclusions
![Page 3: BeiHang Short Course, Part 6: CoRAM FPGA “Architecture”users.ece.cmu.edu/~jhoe/distribution/bhsc14/BH6-CoRAM.pdf · 9/24/2014 2 Why doesn’t everybody want to compute with FPGAs?](https://reader033.fdocuments.us/reader033/viewer/2022050717/5e14de4550ec9444177a67a6/html5/thumbnails/3.jpg)
9/24/2014
3
Review of FPGA Anatomy
Block RAM
I
Programmable lookup tables (LUT) and flip‐flops (FF)
aka “soft logic” or “fabric”
4KBBlockRAM
Write Data
Read DataLocal
Address
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s5
I/O pins Interconnect
LUT FF
Computing on FPGA TodayI/O Pads
kernel
SRAM
User responsible for:
application and data
kernel
SRAMFPGA
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s6
I/O Pads
kernel
SRAM
![Page 4: BeiHang Short Course, Part 6: CoRAM FPGA “Architecture”users.ece.cmu.edu/~jhoe/distribution/bhsc14/BH6-CoRAM.pdf · 9/24/2014 2 Why doesn’t everybody want to compute with FPGAs?](https://reader033.fdocuments.us/reader033/viewer/2022050717/5e14de4550ec9444177a67a6/html5/thumbnails/4.jpg)
9/24/2014
4
Computing on FPGA TodayI/O Pads
Memory Controller Memory Controller
User responsible for:
application and data
platform I/O and
memory interfaces
data distribution
kernel
SRAM
ControlLogic
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s7
memory control logic
I/O Pads
Memory Controller Memory Controller
The CoRAM FPGAI/O Pads
Memory Controller Memory ControllerMemory Interfaces (Global Address Space)
SRA
M
kernel
SRAM
ControlLogic
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s8
I/O Pads
Memory Controller Memory ControllerMemory Interfaces (Global Address Space)
![Page 5: BeiHang Short Course, Part 6: CoRAM FPGA “Architecture”users.ece.cmu.edu/~jhoe/distribution/bhsc14/BH6-CoRAM.pdf · 9/24/2014 2 Why doesn’t everybody want to compute with FPGAs?](https://reader033.fdocuments.us/reader033/viewer/2022050717/5e14de4550ec9444177a67a6/html5/thumbnails/5.jpg)
9/24/2014
5
The CoRAM FPGA
Memory Interfaces (Global Address Space)
SRA
Mkernel
Control
Hard logic for data
distribution (NoC)
“Software”‐managed
memory hierarchy
General‐purpose Network‐on‐Chip
ControlLogic
kernel
SRAM
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s9
Memory Interfaces (Global Address Space)
Controlthreads
General‐purpose Network‐on‐Chip
CoRAM FPGA Architectural View
SRA
M
kernel
Control
Hard logic for data
distribution (NoC)
“Software”‐managed
memory hierarchy
Simple, portable
kernel
SRAM
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s10
Controlthreads
Simple, portable
abstraction for user
Memory
![Page 6: BeiHang Short Course, Part 6: CoRAM FPGA “Architecture”users.ece.cmu.edu/~jhoe/distribution/bhsc14/BH6-CoRAM.pdf · 9/24/2014 2 Why doesn’t everybody want to compute with FPGAs?](https://reader033.fdocuments.us/reader033/viewer/2022050717/5e14de4550ec9444177a67a6/html5/thumbnails/6.jpg)
9/24/2014
6
FPGA as First‐Class, Equal Partner
CPU FPGACPUCPU FPGA
DRAM DRAM
CPU
Memory Bus / Interconnect
Core Core FPGA
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s11
L1
Single‐Chip Heterogeneous Processor
L1
L2 Cache
Fabric
DRAM
Simple Examplectrlthread() {
cohandle c0 = get_sram(“c0”);
1 c coram rite(c0
SRAMAddress
}
module verilog_top(…);
coram c0(/*ports*/);
cofifo f0 (/*ports*);
…endmodule
Control Thread
1 c_coram_write(c0,sram_address,global_address,size);
2 c_fifo_write(1);
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s12
C0WrData
RdData
kernel Thread
1GlobalMemory
DataData
2
Channel FIFO
![Page 7: BeiHang Short Course, Part 6: CoRAM FPGA “Architecture”users.ece.cmu.edu/~jhoe/distribution/bhsc14/BH6-CoRAM.pdf · 9/24/2014 2 Why doesn’t everybody want to compute with FPGAs?](https://reader033.fdocuments.us/reader033/viewer/2022050717/5e14de4550ec9444177a67a6/html5/thumbnails/7.jpg)
9/24/2014
7
Control ActionsEdge Memory
SingleCoRAM
Edge Memory2 CoRAMs
(concatenated)
memaddroffset
bytes
coram_write( coh coram, void *offset,void *memaddr, int bytes);
CoRAM
memaddroffset
bytes
collective_write(coh coram, void *offset,void *memaddr, int bytes);
Edge MemorySrc CoRAM
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s13
2 CoRAMs(interleaved)
memaddr offset
collective_write(coh coram, void *offset,void *memaddr, int bytes);
A BC D
A B C D
srcoffsetdstoffset
bytescoram_copy(coh src, coh dst,
void *srcoffset,void *dstoffset,int bytes);
Dst CoRAM
Design Case Study: MMMult
MMM in hardwareCompute Engine
Single Compute Engine
N compute engines(1 per row of matrix )
custom data network
Compute Engine
Compute Engine
Compute Engine
Compute Engine
Compute Engine
p g
C=ABA
Network
B SRAM
OP=writebackOP=load
R
R
x +
WenRenPktIn
PktOut
A/B SRAMsC SRAM
Control
toNext
PacketProcess
x +
A/B SRAMsC SRAM
Control
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s14
Compute Engine
B C
OP PE_ID DATA
DRAM
Packet Generator Address Generator
Packet Format
C SRAM
B
A SRAM
toNext
![Page 8: BeiHang Short Course, Part 6: CoRAM FPGA “Architecture”users.ece.cmu.edu/~jhoe/distribution/bhsc14/BH6-CoRAM.pdf · 9/24/2014 2 Why doesn’t everybody want to compute with FPGAs?](https://reader033.fdocuments.us/reader033/viewer/2022050717/5e14de4550ec9444177a67a6/html5/thumbnails/8.jpg)
9/24/2014
8
Control Thread Program
MMMult with CoRAM
Channel FIFOPEs ready
void ctrl_thread() {for (j = 0; j < N; j += NB)for (i = 0; i < N; i += NB)for (k = 0; k < N; k += NB) { c_fifo_read(…); for (m = 0; m < NB; m++) { c_collective_write(
ramsA, m*NB, A + i*N+k + m*N
Start compute
x +
A/B SRAMsC SRAM
Control
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s15
A + i*N+k + m*N, NB*dsz);
…}
c_fifo_write(…);}
}}
}
Related Work in Abstracting
• BORPH [So, UCB PhD Thesis 2007]
– support HW kernels with fixed interface/infrastructurepp
– manage HW kernels as processes
– communication between HW and SW processes
• VirtualRC [Kirchgessner, et al., ISFPGA 2012]
– “virtual machine” for FPGA platforms
– fixed SW API to interact with HW kernels
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s16
fixed SW API to interact with HW kernels
• LEAP [Adler, et al., ISFPGA 2011]
– large, linear memory abstraction of local backing store
– communicating with SW and remote memory
![Page 9: BeiHang Short Course, Part 6: CoRAM FPGA “Architecture”users.ece.cmu.edu/~jhoe/distribution/bhsc14/BH6-CoRAM.pdf · 9/24/2014 2 Why doesn’t everybody want to compute with FPGAs?](https://reader033.fdocuments.us/reader033/viewer/2022050717/5e14de4550ec9444177a67a6/html5/thumbnails/9.jpg)
9/24/2014
9
Outline
• Motivations
• CoRAM FPGA Architecture
• Development and evaluations
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s17
• Conclusions
Beneath the Abstraction
Control
M
kernel
Architecture
Microarchitecture
t t
thread programs
Control Unit Control Unit
SRAM
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s18
Memory Interfaces (DRAM or Cache)
Fabric
Bulk Data Distribution (Network‐on‐Chip)
Cluster Unit
Fabric
Cluster Unit
CoRAMFPGA
![Page 10: BeiHang Short Course, Part 6: CoRAM FPGA “Architecture”users.ece.cmu.edu/~jhoe/distribution/bhsc14/BH6-CoRAM.pdf · 9/24/2014 2 Why doesn’t everybody want to compute with FPGAs?](https://reader033.fdocuments.us/reader033/viewer/2022050717/5e14de4550ec9444177a67a6/html5/thumbnails/10.jpg)
9/24/2014
10
Decoder
CoramIdx == 0?
CoramIdx == 1?
CoramEn0 (1‐bit)
Data Steer
4‐inputOR
Mux
LaneSelect0 (2‐bit)
CoRAM Cluster DesignSRAMFrontend Pipeline DataDeq
64x5‐bitmaptable
i10‐bit/6‐bit
Divider
LocalAddress (10‐bit)Map Table Index (5‐bit)
CoramIdx == 31?
CoramIdx == 0?
CoramIdx == 1?
CoramIdx == 31?
CoramIdx == 0?
CoramIdx == 1?
CoramIdx == 31?
CoramIdx == 0?
CoramIdx == 1?
CoramIdx == 31?
Ctrl
LaneSelect (2‐bit)
4‐inputOR
Mux Ctrl
4‐inputOR
Mux Ctrl
4‐inputOR
Mux Ctrl
32
addrQ
C Id (5 bit)
64x5‐bitmaptable
64x5‐bitmaptable
64x5‐bitmaptable
i+1
i+2
i+3
10‐bit/6‐bit
Divider
10‐bit/6‐bit
Divider
10‐bit/6‐bit
Dividerw
idxQ Hazard Detection
Deq0
Deq1
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s19
DataIn (32‐bit)
CORAM_WORD_WIDTH (32‐bit)
Data arrives sequentially from main memory, all accesses assumed 4‐byte word‐aligned
datai
datai+1
datai+2
datai+3
dataQCoramIdx (5‐bit)
CoramAddr (10‐bit)
Data response packet arrival in single clock
Cluster Resources
5000
FPGA logic area for 32‐way Cluster, 16B memory datapath, W=4
21% of V6 LX760 (if every BRAM used as CoRAM)
1500
2000
2500
3000
3500
4000
4500
t LU
Ts (40nm Virtex‐6)
Output 32x4 Crossbar (data)Input 4x32 Crossbar (data)Input 4x32 Crossbar (address)MemRequestQ (data‐out)DataOutQsDividers (7‐bit x 6‐bit)Input Crossbar DecoderDataInQsIntermediate Data QsMessage CountersLane SchedulerIntermediate State TablesOtherFreelist FIFO (physical tags)PtagQ (data‐out)Freelist FIFO (transaction tags)
21% of V6‐LX760 (if every BRAM used as CoRAM)
17%
10% 9%
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s20
0
500
1000
1500
6‐input Freelist FIFO (transaction tags)
Input 4x32 Crossbar (1‐bit rnw)Pattern TableDivision Status TableFree ptag counterIntermediate Token QPattern Offset TableLane Mask LogicGrantQControlActionQMemRequestQ (data‐in)
<1%FPGA max clock freq ~ 250‐300MHz
65nm ASIC ~1.1GHz
![Page 11: BeiHang Short Course, Part 6: CoRAM FPGA “Architecture”users.ece.cmu.edu/~jhoe/distribution/bhsc14/BH6-CoRAM.pdf · 9/24/2014 2 Why doesn’t everybody want to compute with FPGAs?](https://reader033.fdocuments.us/reader033/viewer/2022050717/5e14de4550ec9444177a67a6/html5/thumbnails/11.jpg)
9/24/2014
11
RTL Prototyping on FPGA
Microblaze(100MHz)
I/O DevicesI/O Subsystem
100MHz
Router
Router
CacheBank
CacheBank
Memory
128b
128bRouter
AXI Memory Bus
100MHz100MHz
Adapter
200MHz
Router
x+
Control Unit
x+
Control Unit
x+
Control UnitControl
and Drivers
MMMultiply Compute Elements
CoRAMCluster
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s21
MemoryController(Xilinx MIG) 128b
Arbiter
RouterRouter
RouterCacheBank
128bCacheBank
Xilinx Platform
512b
x+
Control Unit
x+
Control Unit
x+
Control UnitControl
CoRAMCluster
NoC and ClustersDRAM Interfaces DRAM Caches
CoRAMCluster
Router
Outline
• Motivations
• CoRAM FPGA Architecture
• Development and evaluations
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s22
• Conclusions
![Page 12: BeiHang Short Course, Part 6: CoRAM FPGA “Architecture”users.ece.cmu.edu/~jhoe/distribution/bhsc14/BH6-CoRAM.pdf · 9/24/2014 2 Why doesn’t everybody want to compute with FPGAs?](https://reader033.fdocuments.us/reader033/viewer/2022050717/5e14de4550ec9444177a67a6/html5/thumbnails/12.jpg)
9/24/2014
12
M
kernel
BTW, we got it wrong……
M
FFT
M
FFT
CoRAM
kernel
Controlthreads
Memory
CoRAM
FFT(streaming)
CoRAM
FFT(streaming)
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s23
• Architecture design is an exercise in compromise
i.e., enough but not too much
• Do we really need an FPGA “architecture” just for computing?
CoRAM “Architecture” Revisited
Memory Interfaces (Global Address Space) Must have hardened
CoRA
M
kernel
Control
memory interface and NoC
I like the decoupled computation paradigm
Do I need to harden
General‐purpose Network‐on‐Chip
kernel
CoRAM
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s24
Memory Interfaces (Global Address Space)
Controlthreads
Do I need to harden CoRAM?
– fixed set of commands
– insolate logic to one‐side
General‐purpose Network‐on‐Chip
![Page 13: BeiHang Short Course, Part 6: CoRAM FPGA “Architecture”users.ece.cmu.edu/~jhoe/distribution/bhsc14/BH6-CoRAM.pdf · 9/24/2014 2 Why doesn’t everybody want to compute with FPGAs?](https://reader033.fdocuments.us/reader033/viewer/2022050717/5e14de4550ec9444177a67a6/html5/thumbnails/13.jpg)
9/24/2014
13
CoRAM Reconstructed as “APIs”
nds API
‐like API
infrastructure for lib builders
abstraction for app builders
CoRAM
CoRAM
controlh d k l
comman
mem
ory
pattern specific logic
pattern specificlogic B
RAM
NoC
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s25
threads HW kernels
end developer
end developer
pattern specificlogic
Memory Interface
Recap the Last Hour
• Future must look beyond general purpose
– FPGAs offer impressive performance and efficiencyp p y
– but need to be easy‐to‐use and portable
• The CoRAM Architecture/API
– standard, scalable memory abstraction for FPGA
– simplifies application development and portability
• Software is easy; hardware is hard?
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s26
• Software is easy; hardware is hard?
– C libraries and system calls is a big part of this impression
– CPUs and GPUs are not much easier if to get the last ounce of performance
![Page 14: BeiHang Short Course, Part 6: CoRAM FPGA “Architecture”users.ece.cmu.edu/~jhoe/distribution/bhsc14/BH6-CoRAM.pdf · 9/24/2014 2 Why doesn’t everybody want to compute with FPGAs?](https://reader033.fdocuments.us/reader033/viewer/2022050717/5e14de4550ec9444177a67a6/html5/thumbnails/14.jpg)
9/24/2014
14
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L6‐s27
Computer Architecture Lab (CALCM)
Carnegie Mellon University
http://www.ece.cmu.edu/CALCM