Post on 24-Apr-2020
6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013
LECTURE 16 FINE-GRAINED RECONFIGURABLE
COMPUTING: FPGAS
JOEL EMER AND DANIEL SANCHEZ
Field Programmable Gate Arrays (FPGA)
And
00 0
01 0
10 0
11 1
LUT Latch
RAM .....
Or
00 0
01 0
10 1
11 1
Evolution of FPGA applications 3
¨ Logic Replacement ¤ Low design cost and effort ¤ Low volume applications ¤ Often replaced with ASIC as volume increases
¨ Algorithmic computation ¤ Offloads a general purpose processor ¤ Used for multiple algorithms ¤ ASIC replacement not expected
6.888 Spring 2013 - Sanchez and Emer – L16
Benefits of FPGA computation 4
¨ Custom operations/data types – custom operations/data types
¨ Flexible flow control - control flow based on arbitrary state machines ¨ Local state access - local state elements allows parallel state access ¨ Fine grain parallelism - replicated logic permits easy parallelism
¨ Custom communication - explicit direct inter-module communication ¨ Reduced memory references – more direct reuse of data
¨ Better power efficiency – more activity directly applied to computation ¨ Better area efficiency – more area directly applied to computation
6.888 Spring 2013 - Sanchez and Emer – L16
QPI-attached FPGA platform Intel QuickAssist QPI-based FPGA Accelerator Platform (QAP)
Four Socket QuickAssist Platform Topology
QPI Links
P M
P M
M
P M
CS
I/O I/O
CS
I/O I/O
QPI Links
Altera Stratix IV Module
Xilinx Virtex 6 Module
Intel® Xeon processor 7000 series
Accelerator Hardware Module (AHM)
FPGA
QuickAssist FPGA
PCIe attached FPGA
6.888 Spring 2013 - Sanchez and Emer – L16
5
Reed Solomon Results
Xilinx IP Catapult-C
Bluespec
Equivalent Gate Count 297,409 596,730 267,741
Frequency (MHz) 145.3 91.2 108.5
Steady State (Cycles/Block)
660 2073 276
Data rate (Mbps) 392.8 89.7 701.3
Lower is better Higher is better
WiMAX requirement is to support a throughput of 134Mbps
Source: MIT, Abhinav Agarwal, Alfred Ng – CSG
BORPH
¨ Berkeley Operating system for ReProgrammable Hardware
¨ OS for reconfigurable computers ¤ Treats reconfigurable hardware as computational
resources ¨ UNIX interface to HW designs
¤ Familiar to both software and hardware engineers ¤ Design language independent
¨ Goal:
Make FPGA-based reconfigurable computers easy to use
6.888 Spring 2013 - Sanchez and Emer – L16
7
6.888 Spring 2013 - Sanchez and Emer – L16
Conventional View of FPGA Systems
User Process (SW)
OS Kernel
User Process (SW)
User Process (SW)
User Library
Hardware Platform (Network, UART, HD…)
file IPC
Device Driver
Softw
are
Har
dwar
e
socket pipe
FPGA “coprocessor”
Master-Slave Relationship
6.888 Spring 2013 - Sanchez and Emer – L16
8
6.888 Spring 2013 - Sanchez and Emer – L16
BORPH Layers User Process
(SW) User Process
(SW) User Process
(SW)
Hardware Platform (Network, UART, HD…)
Device Driver
User Process (HW)
User Process (HW)
Hardware User Library
BORPH Kernel
Softw
are
Har
dwar
e
User Library
file IPC socket pipe
ioreg virtual
file
Peer-to-Peer Relationship
6.888 Spring 2013 - Sanchez and Emer – L16
9
Overview of BORPH Concepts
¨ Hardware process ¨ Hardware syscall interface
¨ Interacting with an FPGA ¤ ioreg virtual file interface ¤ Hardware file I/O
SW SW SW
Hardware Platform (Network, UART, HD…)
Device Driver
HW HW
Hardware User Library
BORPH Kernel Softw
are
Har
dwar
e
User Library file IPC socket pipe
ioreg
6.888 Spring 2013 - Sanchez and Emer – L16
Hardware Process (1)
¨ An executing instance of a hardware design ¤ SW: An executing instance of a
program
¨ Normal UNIX process ¤ Has pid, check status with ps, kill, etc
¨ Unit of management ¨ Created when a BORPH Object
File (BOF) file is exec-ed ¤ Kernel selects and configure
hardware region automatically
6.888 Spring 2013 - Sanchez and Emer – L16
SW SW SW
Hardware Platform (Network, UART, HD…)
Device Driver
HW HW
Hardware User Library
BORPH Kernel Softw
are
Har
dwar
e
User Library file IPC socket pipe
ioreg
11
Benefits of UNIX Process Model
¨ Very easy for user to reason about ¨ Enable FPGA designs to become active component of the
system ¤ e.g. an FIR filter: ¤ Conventional: a passive entity where software sends/receives data ¤ BORPH: an active entity that pulls/pushes data as needed
¨ Enable multiple instances of the same FPGA design running in the system ¤ No more fixed accelerator concept ¤ Works well in true reconfigurable computing systems
6.888 Spring 2013 - Sanchez and Emer – L16
12
SW SW SW
Hardware Platform (Network, UART, HD…)
Device Driver
HW HW
Hardware User Library
BORPH Kernel Softw
are
Har
dwar
e
User Library file IPC socket pipe
ioreg
HW Processes I/O ¨ I/O managed by kernel
¤ Similar to SW
¨ Hide details from users ¤ e.g. HW-SW, HW-HW UNIX
file pipe
¨ Standard UNIX I/O mechanism ¤ File I/O, pipe, signal
¨ HW specific service ¤ ioreg virtual file system
Don’t ask “How do I … in HW”. Think: “What if it were SW?”
6.888 Spring 2013 - Sanchez and Emer – L16 13
ioreg Virtual File System ¨ Maps user defined hardware constructs as virtual files under the process’s /
proc/<pid>/hw/ioreg/ directory ¤ Single word register ¤ Memory: On-chip + Off-chip ¤ FIFO
¨ Example: ¤ /proc/123/hw/ioreg/COUNTERVAL
¨ ioreg information embedded in the executing BOF file ¨ read and write system calls translated to message packet by the kernel
¤ Any UNIX program can communicate with hardware processes n Shell: echo 1 > /proc/123/hw/ioreg/enable n C: MEM_FILE = n fopen(“/proc/123/hw/ioreg/MyMemory”, “r”); n fread(swbuf, 1, MEM_SIZE, MEM_FILE); n Python, Java, etc…
6.888 Spring 2013 - Sanchez and Emer – L16
14
Hardware File I/O ¨ Access to the general file system from hardware processes ¨ Debug by printing
¤ printf ¨ Read test vectors, record output
¨ SW/HW processes chained by file pipe
Baseband Process A/D Analog
Frontend Upper Layer
Decode Resize Edge
Detect Encode
video.in video.out
bash$ decode video.in | resize | edgdet.bof | encode > video.out
bash$ receiver.bof < file.in > file.out
6.888 Spring 2013 - Sanchez and Emer – L16
15
Latency-Insensitive Design: A Higher Semantic
¨ Inter-module communication by latency insensitive channels ¤ Changing the timing behavior of a module does not affect functional correctness of
the program
¨ Many HW designs use this methodology ¤ Improved modularity ¤ Simplified design-space exploration
¨ Implemented with guarded FIFOs in current RTLs
Control
Timing Partition
Exe Decode Fetch
FPGA
Functional Partition
Exe Decode Fetch
Control Partition
16
FPGA
FPGA1
FPGA0
Latency-Insensitive Design: A Higher Semantic
Timing Partition
Exe Decode Fetch
Functional Partition
Exe Decode Fetch
Control
Control Partition
17 Behavior of LI channels does not affect functional correctness.
Latency-Insensitive Design: A Higher Semantic
¨ There are many FIFOs in the design ¤ It may not be safe to modify some of them
¨ Compilers see only wires and registers ¤ Reasoning about cycle accuracy is difficult
Control
Timing Partition
Exe Decode Fetch
FPGA
Functional Partition
Exe Decode Fetch
Control Partition
18 But the programmer knows about the LI property…
A Syntax for LI Design
¨ Programmer needs to differentiate LI channels from normal FIFOs
¨ Latency-Insensitive Send/Recv endpoints ¤ Implementation chosen by
compiler ¤ FIFO order ¤ Guaranteed delivery
¨ Explicit programmer contract ¤ Unspecified buffering &
unspecified latency ¤ Programmer responsible for
correct annotation
module mkTimeP; Send#(Inst) send <- mkSend(“Decode”); endmodule
module mkFuncP; Recv#(Inst) recv <- mkRecv(“Decode”); endmodule
mkFuncP RTL
mkTimeP RTL
19 Easy to use – often a textual substitution!
Pla8orm_1
Pla8orm_1 Comms
Pla8orm_0
Pla8orm_0 Comms
Connected User Application
mkApplicaAon
mkB_Stub
mkC
mkA_Stub
mkApplicaAon_Stub
mkB
mkC_Stub
mkA
Library Generated User Key: 6.888 Spring 2013 - Sanchez and Emer – L16
20