Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of...
Transcript of Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of...
![Page 1: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/1.jpg)
Specialized Computing
ECE 5775 (Fall’17)High-Level Digital Design Automation
![Page 2: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/2.jpg)
▸ All students enrolled in CMS & Piazza
▸ Vivado HLS tutorial on Tuesday 8/29– Install an SSH client (mobaxterm or putty)– % ssh -X <netid>@ecelinux-01.ece.cornell.edu– Bring your laptop!
1
Announcements
![Page 3: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/3.jpg)
Agenda
▸ Motivation for hardware specialization– Case study on convolution
▸ FPGA introduction– Basic building blocks– Basic homogeneous FPGA architecture– Modern heterogeneous FPGA architecture
2
![Page 4: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/4.jpg)
Tension between Efficiency and Flexibility
3© 2012 Altera Corporation—Public
The Dilemma: Flexibility vs. Efficiency
16
Source: “High-performance Energy-Efficient Reconfigurable Accelerator Circuits for the Sub-45nm Era” July 2011by Ram K. Krishnamurthy, Circuits Research Labs, Intel Corp.
MO
PS
/mW
Programmable Processing
![Page 5: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/5.jpg)
4
A Simple Single-Cycle Microprocessor
Adder1
PC
RF
LDSASBDRD_in
ALU RAM
DataA
DataB
V C Z NSE
IMMMB
M_address
Data_in
MW MD
01
0
1
DRSASB
IMMMBFS
MDLD
MWRAM
![Page 6: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/6.jpg)
5
Evaluating an Simple Expression on CPU
Step-by-stepCPU activities
R1 <= M[R0]
R3 <= R1 + R2
R2 <= M[R0+1]
M[R0+2] <= R3
Source: Adapted from Desh Singh’s talk at HCP’14 workshop
PC
RF ALU RAM
PC
RF ALU RAM
PC
RF ALU RAM
PC
RF ALU RAM
![Page 7: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/7.jpg)
6
“Unrolling” the CPU Hardware
Space
1. Replicate the CPU hardware
PC
RF ALU RAM
PC
RF ALU RAM
PC
RF ALU RAM
R1 <= M[R0]
PC
RF ALU RAM
R3 <= R1 + R2
R2 <= M[R0+1]
M[R0+2] <= R3
CPU2
CPU3
CPU4
CPU1
![Page 8: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/8.jpg)
7
Eliminating Unused Logic
RF ALU RAM
R1 <= M[R0]
R3 <= R1 + R2
R2 <= M[R0+1]
M[R0+2] <= R3RF ALU RAM
RF ALU RAM
RF ALU
Space
1. Replicate the CPU hardware
2. Instruction fixed -> Remove FETCH logic
3. Remove unused ALU operations
4. Remove unusedLOAD/STORE logic
![Page 9: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/9.jpg)
8
A Special-Purpose Architecture
R1 <= M[R0]
R3 <= R1 + R2
R2 <= M[R0+1]
M[R0+2] <= R3
Space
1. Replicate the CPU hardware
2. Instruction fixed -> Remove FETCH logic
3. Remove unused ALU operations
4. Remove unusedLOAD/STORE logic
5. Wire up registers and propagate values
R2
R3
+
R1
+
R0
+
LW
LW
SW
Resulting circuit can be realized with either ASIC or FPGA
![Page 10: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/10.jpg)
9
Understanding Energy Inefficiency of General-Purpose Processors (GPPs)
L1-I$
BranchPredictor
FetchDecode Rename
RAT
Free list
Reservation-station
Schedule
Int RF
FP RF
RegisterRead/write
D-cache
LSQ + TLB
ALU
FPU
ROB
Commit
Typical Superscalar OoO Pipeline
Parameter Value
Fetch/issue/retire width 4
# Integer ALUs 3
# FP ALUs 2
# ROB entries 96
# Reservation station entries 64
L1 I-cache 32 KB, 8-way set associative
L1 D-cache 32 KB, 8-way set associative
L2 cache 6 MB, 8-way set associative
[source: Jason Cong, ISLPED’14 keynote]
![Page 11: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/11.jpg)
10
Energy Breakdown of Pipeline Components
Fetch unit9%
Decode6%
Rename12%
Register files3%
Scheduler11%
Int ALU14%
Mul/div4%
FPU8%
Memory10%
Misc23%
L1-I$
BranchPredictor
FetchDecode Rename
RAT
Free list
Reservation-station
Schedule
Int RF
FP RF
RegisterRead/write
D-cache
LSQ + TLB
ALU
FPU
ROB
Commit
Control
![Page 12: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/12.jpg)
11
Removing “Non-Computing” Portions
Fetch unit9%
Decode6%
Rename12%
Register files3%
Scheduler11%
Int ALU14%
Mul/div4%
FPU8%
Memory10%
Misc23%
▸ “Computing” portion: 10% (memory) + 26% (compute) = 36%
![Page 13: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/13.jpg)
Energy Comparison of Processor ALUs and Dedicated Units
▸ Why are processor units so expensive?– ALU can perform multiple
operations• Add/sub/bitwise
XOR/OR/AND– 64-bit ALU– Dynamic/domino logic
used to run at high frequency• Higher power dissipation
Operation Processor ALU
45 nm TSMC standard cell library
32-bit add 0.122 nJ@2 GHz
0.002 nJ @1 GHz
32-bitmultiply
0.120 nJ@2 GHz
0.007 nJ @1 GHz
Single-precisionFP operation
0.150 nJ @ 2GHz
0.008 nJ @500 MHz
12
![Page 14: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/14.jpg)
13
Energy Breakdown with Standard-Cell ASICs
Fetch unit9%
Decode6%
Rename12%
Register files3%
Scheduler11%
Int ALU0.2%
Mul/div0.2%
FPU0.4%
ALU/FPU savings25.0%
Memory10%
Misc23%
▸ “Computing” portion: 10% (memory) + ~1% (compute) = 11%
Only 10X gain is attainable?
![Page 15: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/15.jpg)
14
Additional Energy Savings from Specialization
▸ Specialized memory architecture– Exploit regular memory access patterns to minimize energy per
memory read/write
▸ Specialized communication architecture– Exploit data movement patterns to optimize the
structure/topology of on-chip interconnection network
▸ Customized data type– Exploit data range information to reduce bitwidth/precision and
simply arithmetic operations
These techniques combined can lead to another 10-100X energy efficiency improvement over GPPs
![Page 16: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/16.jpg)
Case Study: Memory Specialization for Convolution
▸ The main computation of image/video processing is performed over overlapping stencils, termed as convolution
-1 -2 -1
0 0 01 2 1
3x3 ConvolutionInput image frame
Output image frame
0123456
0 1 2 3 4 5 60123456
0 1 2 3 4 5 6
15
![Page 17: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/17.jpg)
Example Application: Edge Detection
▸ Identifies discontinuities in an image where brightness (or image intensity) changes sharply– Very useful for feature extractions in computer vision
Figures: Pilho Kim, GaTech
Sobel operator G=(GX , GY)
16
![Page 18: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/18.jpg)
CPU Implementation of Convolution
CPU MainMemory
Cache
for (n=1; n<height-1; n++) for (m=1; m<width-1; m++)
for (i=0; i<3; i++) for (j=0; j<3; j++)
out[n][m]+=img[n+i][m+j] * f[i][j];
17
![Page 19: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/19.jpg)
▸ Minimizes main memory accesses to improve performance
▸ A general-purpose cache is expensive in cost and incurs nontrivial energy overhead
General-Purpose Cache for Convolution
W
Input picture(W pixels wide)
18
![Page 20: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/20.jpg)
Specializing Cache for Convolution (1)
▸ Remove rows that are not in the neighborhood of the convolution window
19
W
![Page 21: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/21.jpg)
W
Specializing Cache for Convolution (2)
W W
Remove the edge pixels that are not needed for computation
Old Pixel
New Pixel
▸ Rearrange the rows as a 1D array of pixels▸ Each time we move the window to right and push in the
new pixel to the “cache”
20
New
Old
![Page 22: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/22.jpg)
A Specialized “Cache”: Line Buffer
▸ Line buffer: a fixed-width “cache” with (K-1)*W+K pixels in flight– Fixed addressing: Low area/power and high performance
▸ In customized FPGA implementation, line buffers can be efficiently implemented with on-chip BRAMs
2W+3 (with K=3)
Old Pixel
New Pixel
21
![Page 23: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/23.jpg)
What is an FPGA?
▸ FPGA: Field-Programmable Gate Array– An integrated circuit designed to be configured by a customer or
a designer after manufacturing (wikipedia)
▸ Components in an FPGA Chip– Programmable logic blocks– Programmable interconnects– Programmable I/Os
22
![Page 24: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/24.jpg)
▸ SRAM-based implementation is popular– Non-standard technology means older technology generation
23
Three Important Pieces
Pass transistor (controlled by an SRAM bit)
Multiplexer (controlled by SRAM bits)
Lookup table (LUT, formed by SRAM bits)
LUT
![Page 25: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/25.jpg)
▸ Any function of k variables can be implemented with a 2k:1 multiplexer
24
Multiplexer as a Universal Gate
01010101
00110011
00001111
CoutSCinBA01101001
01010101
00110011
00001111
CoutSCinBA
01101001
00010111
01010101
00110011
00001111
CoutSCinBA01101001
00010111
01010101
00110011
00001111
CoutSumCinBA
?? ?
01234567
S2
8:1 MUX
S1 S0
Cout
????????
![Page 26: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/26.jpg)
▸ How many distinct 3-input 1-output Boolean functions exist?
▸ What about K inputs?
25
How Many Functions?
![Page 27: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/27.jpg)
26
Look-Up Table (LUT)
0/10/10/10/10/10/10/10/1
MU
X… Y
x2
A 3-input LUT
§ A k-input LUT (k-LUT) can be configured to implement any k-input 1-output combinational logic – 2k SRAM bits– Delay is independent of logic
function
x1 x0
![Page 28: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/28.jpg)
▸ How many 3-input LUTs are needed to implement the following full adder?
▸ How about using 4-input LUTs?
27
How Many LUTs?
A B Cin Cout S
0 0 0 0 00 0 1 0 10 1 0 0 10 1 1 1 01 0 0 0 11 0 1 1 01 1 0 1 01 1 1 1 1
![Page 29: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/29.jpg)
28
A Logic Element
LUT
▸ A k-input LUT is usually followed by a flip-flop (FF) that can be bypassed▸ The LUT and FF combined form a logic element
![Page 30: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/30.jpg)
29
A Logic Block
▸ A logic block clusters multiple logic elements
▸ Example: In Xilinx 7-series FPGAs, each configurable logic block (CLB) has two slices – Two independent carry chains per
CLB for implementing adders– Each slice contains four LUTs
Crossbar Sw
itch
CIN CIN
COUT COUT
SLICE
SLICE
![Page 31: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/31.jpg)
Traditional Homogeneous FPGA Architecture
30
Logic block
Switchblock
Routing track
![Page 32: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/32.jpg)
Modern Heterogeneous Field-Programmable System-on-Chip
[Figure credit: embeddedrelated.com]
▸ Island-style configurable mesh routing▸ Lots of dedicated components
– Memories/multipliers, I/Os, processors– Specialization leads to higher performance and lower power
31
![Page 33: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/33.jpg)
▸ Built-in components for fast arithmetic operation optimized for DSP applications – Essentially a multiply-accumulate core with many
other features
– Fixed logic and connections, functionality may be configured using control signals at run time
– Much faster than LUT-based implementation (ASIC vs. LUT)
32
Dedicated DSP Blocks
![Page 34: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/34.jpg)
33
Example: Xilinx DSP48E Slice
§25x18 signed multiplier§48-bit add/subtract/accumulate§48-bit logic operations§SIMD operations (12/24 bit)§Pipeline registers for high speed
[source: Xilinx Inc.]
![Page 35: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/35.jpg)
Finite Impulse Response (FIR) Filter Mapped to DSP Slices
C0 C1 C2 C3
0
DSP Slice
x(n)
y(n)38
18
y[n]= cix[n− i]i=0
N
∑
[source: Xilinx Inc.]
![Page 36: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/36.jpg)
▸ Example: Xilinx 18K/36K block RAMs – 32k x 1 to 512 x 72 in one
36K block– Simple dual-port and true
dual-port configurations– Built-in FIFO logic– 64-bit error correction
coding per 36K block
[source: Xilinx Inc.] 35
DIADIPAADDRAWEAENA
CLKA
DIBDIPB
WEBADDRB
ENB
DOA
CLKB
DOPA
DOPBDOB
18K/36K block RAM
Dedicated Block RAMs (BRAMs)
![Page 37: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/37.jpg)
Embedded FPGA System-on-Chip
Xilinx Zynq All Programmable System-on-Chip[Source: Xilinx Inc.]
Dual ARM Cortex-A9 + NEON SIMD extension @600MHz~1GHz
Up to 350K logic cells2MB Block RAM 900 DSP48s
36
![Page 38: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/38.jpg)
▸ Massive amount of fine-grained parallelism
▸ Silicon configurable to fit the application
▸ Performance/watt advantage over CPUs & GPUs
37
FPGA as an Accelerator for Cloud Computing
Bloc
k RA
M
Bloc
k RA
M
~2 Million Logic Blocks
~5000DSP Blocks
~300MbBlock RAM
AWS Cloud F1 FPGA instance: Xilinx UltraScale+ VU9P
[Figure source: David Pellerin, AWS]
![Page 39: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/39.jpg)
38
Microsoft Deploying FPGAs in Datacenter
[source: Microsoft, BrainWave, HotChips’2017]
▸ FPGAs deployed in Microsoft datacenters to accelerate various web, database, and AI services– e.g., project BrainWave claimed ~40Teraflops on large recurrent
neural networks using Intel Stratix 10 FPGAs
![Page 40: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/40.jpg)
▸ Massive amount of fine-grained parallelism – Highly parallel and/or deeply pipelined to achieve maximum
parallelism– Distributed data/control dispatch
▸ Silicon configurable to fit algorithm– Compute the exact algorithm at the desired level of numerical
accuracy• Bit-level sizing and sub-cycle chaining
– Customized memory hierarchy
▸ Performance/watt advantage– Low power consumption compared to CPU and GPGPUs
• Low clock speed• Specialized architecture blocks
39
Summary: FPGA as a Programmable Accelerator
![Page 41: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/41.jpg)
▸ Vivado HLS tutorial (led by “TAs”)
40
Next Class
![Page 42: Specialized Computing - Cornell University Computing ECE 5775 ... The main computation of image/video processing is ... DOPA DOPB DOB 18K/36K block RAM](https://reader034.fdocuments.us/reader034/viewer/2022042605/5a9ea1eb7f8b9a76178b9e6d/html5/thumbnails/42.jpg)
▸ These slides contain/adapt materials developed by– Prof. Jason Cong (UCLA)
41
Acknowledgements