From Software to Circuits: High-Level Synthesis for FPGA-Based Processor/Accelerator Systems Jason...
-
Upload
philippa-roberts -
Category
Documents
-
view
218 -
download
0
Transcript of From Software to Circuits: High-Level Synthesis for FPGA-Based Processor/Accelerator Systems Jason...
From Software to Circuits: High-Level Synthesis for FPGA-Based
Processor/Accelerator SystemsJason Anderson
Tools to Tackle Big Data – Big Data Workshop3 July 2014
Dept. of Electrical and Computer EngineeringUniversity of Toronto
LegUp Research Team
• Undergrad Researchers: Mathew Hall, Stefan Hadjis, Joy Chen
• Faculty: Stephen Brown and myself • Industry Liaison: Tomasz Czajkowski, Altera
AndrewCanis
JamesChoi
NazaninCalagar
LannyLian
Blair Fort
Computations in Two Ways
Write Software
Computations in Two Ways
Write Software
Computations in Two Ways
Write Software
Computations in Two Ways
Write Software
Computations in Two Ways
Write Software Design Custom Circuits
Computations in Two Ways
Write Software Design Custom Circuits
Computations in Two Ways
Write Software Design Custom Circuits
Computations in Two Ways
Design Methodology
Write software
Design Methodology
Write software• Easy
Design Methodology
Write software• Easy• Flexibility lower performance
Design Methodology
Write software• Easy• Flexibility lower performance
Design Custom Circuits
Design Methodology
Write software• Easy• Flexibility lower performance
Design Custom Circuits• Efficient, low power
Design Methodology
Write software• Easy• Flexibility lower performance
Design Custom Circuits• Efficient, low power• Need specialized knowledge
Design Methodology
Hardware’s Potential
• Implementing computations in FPGA hardware can have speed/energy advantages over software:– Lithography simulation: 15X speed-up [Cong & Zou, TRETS’09]– Linear system solver: 2.2X speed-up, 5X more energy efficient
[Zhang, Betz, Rose, TRETS’12]– Monte Carlo simulation for photodynamic therapy: 80X faster,
45X more energy efficient [Lo et al., J. Biomed Optics’09]– Options pricing: 4.6X faster, 25X more energy efficient
[Tse, Thomas, Luk, TVLSI’12]
So Why Doesn’t Everybody Use Hardware?
• Hardware design is difficult and skills are rare:– Requires use of hardware description languages:
Verilog and VHDL• Low-level of abstraction (individual bits)
– 10 software engineers for every hardware engineer* • We need a CAD flow that simplifies hardware
design for software engineers
*US Bureau of Labour Statistics 2012
• High-Level Synthesis– Design circuits using software languages– From a software program, high-level
synthesis tool automatically “synthesizes” circuit that does the same computations as the program
– Benefits of software programmability and hardware performance
A Solution
LegUp High-Level Synthesis for FPGAs
• LegUp is a high-level synthesis tool we have been developing since 2009.
• Takes a C program as input, and produces a circuit.
• 1000+ downloads of our tool since its first release in 2011.
• http://legup.eecg.toronto.edu
legup.eecg.toronto.edu
Why Use FPGAs to Implement Circuits?
• Building fully fabricated custom chips is hard– Very complex design process– Costs $millions to prototype a chip– Takes 2-3 months to fabricate– Only done for high volume applications or apps that
require high speed or lowest power
• Alternative: pre-fabricated, programmable chips
Field-Programmable Gate Arrays (FPGAs)
Field-Programmable Gate Arrays• Pre-fabricated chip consists of “array” of logic blocks Surrounded by
programmable interconnect• Hardware “becomes” what you want by programming blocks and
interconnect (electrically)
Channels ofprogrammableinterconnect
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
Blo
ck R
AM
Blo
ck R
AM
Blo
ck R
AM
Har
d IP
Blo
ckH
ard
IP B
lock
Configurablelogic block
Common blocks: multiplier, DSP,
processor,PCI, ADC, DLL
SRAM block(e.g., 18 kbits)
A Real FPGA – Altera Stratix III
FPGA Advantages over “Hard” Chips
• “Manufacture” takes seconds vs. months• Design, test and manufacture:
$single-digit millions vs. $tens of millions• Giving:
– Faster time-to-market for products– FPGA vendor handles difficult design & manufacture issues– FPGA vendor shares inventory risk across many customers– FPGA vendor does test
• Two largest FPGA vendors: Xilinx and Altera
FPGAs and High-Level Synthesis
• FPGAs mainly accessible to HW engineers– Vendors want to expand user-base:
make FPGAs useable as computing platforms• Area/power/delay gap between HLS-generated
HW and manually crafted HW– In custom Si, user must “pay” for area gap– Power/performance one of main reasons to go custom
• FPGAs likely the IC media through which HLS goes “mainstream”
LegUp: Top-Level Vision
Program code
C CompilerProcessor
(MIPS/ARM)
Self-ProfilingProcessor
Profiling Data:
Execution CyclesPower
Cache Misses
High-levelsynthesis Suggested
programsegments to
target to HWFPGA fabric
P Hardenedprogramsegments
Altered SW binary (calls HW accelerators)
int FIR(int ntaps, int sum) { int i; for (i=0; i < ntaps; i++) sum += h[i] * z[i]; return (sum);}....
LegUp: Key Features• C to Verilog high-level synthesis• Many benchmarks (incl. 12 CHStone)• Automated verification tests• Support for four different FPGAs:
– Altera Cyclone II, Stratix IV, Cyclone IV, Cyclone V-SoC
• Open source, freely downloadable
How Does High-Level Synthesis Work?
Digital Circuits
• Example: you buy a “1 GHz processor”
Digital Circuits
• Example: you buy a “1 GHz processor”
1 GHz = 1 nanosecond time-steps
Some computation is done in each time step
Digital Circuits
• Example: you buy a “1 GHz processor”
1 GHz = 1 nanosecond time-steps
time
Some computation is done in each time step
Digital Circuits
• Example: you buy a “1 GHz processor”
1 GHz = 1 nanosecond time-steps
time
1ns
Some computation is done in each time step
Example Circuit
1ns
A B
+ Calculate A+B
Example Circuit
1ns
A B
+
Store computation after each step
Example Circuit
1ns
A B C D E F
+ – *
Example Circuit
1ns
1ns
A B C D E F
+
*
– *
Example Circuit
1ns
1ns
1ns
(A+B)*(C–D) – (E*F)
A B C D E F
+
–
*
*–
Scheduling: Key Aspect of HLS
• How to assign the computations of a program into the hardware time steps?
C language snippet:
z = a+b;x = c+d;q = z+x;q = q-2;r = q*2;
Programs do not contain the notionof “time steps”.Here, we have: 3 add operations 1 subtract operation 1 multiplication operation
Scheduling
Questions:• Which operations can be scheduled
in the same time step?• Which operations are dependent
on others?• If addition takes 5ns, subtraction
takes 5ns and multiplication takes 10ns, how to schedule?– Target clock step length is 10ns
C language snippet:
z = a+b;x = c+d;q = z+x;q = q-2;r = q*2;
Scheduling
10ns
10ns
10ns
+ +
+
-
*
2
2
a b c d
Scheduling
10ns
10ns
10ns
+ +
+
-
*
2
2
a b c d
chaining
parallel operations
HLS Challenges• Performance of HLS-generated circuits not
as good as human-designed circuits
• However, HLS-generated circuits are already better than SW in many cases
• Much of our research is aimed towards improving HLS quality
Loop Pipelining
Loop Pipeliningfor (int i = 0; i < N; i++) {
sum[i] = a + b + c + d}
+
a b
+
c
+
d
cycle
1
2
3
• Cycles: 3N• Adders: 3• Utilization: 33%
Loop PipeliningCycle 1 2 3 4 5 … N N+1 N+2
i=0 + + +
i=1 + + +
i=3 + + +
…. …. … …. …
i=N-2 + + +
i=N-1 + + +
• Cycles: N+2 (~1 cycle per iteration)• Adders: 3• Utilization: 100% in steady state
Steady State
Loop Pipelining
• Ideally, we could start a loop iteration every clock cycle– Initiation interval (II) = 1
• However,– Loops may have dependencies across iterations– There may be constraints on resources
• e.g. only two memory accesses in a cycle
• Loop pipelining seeks to minimize II subject to constraints
Exploiting Spatial Parallelism
Motivation• Speed benefits of HW arise from spatial
parallelism• Extracting parallelism from a sequential
program is difficult• Auto-parallelizing compilers do not work well!
• Easier to start from parallel code• Pthreads/OpenMP can help!
Background
Programming Models
Background
Programming Models
SequentialC/C++
Background
Programming Models
SequentialC/C++
Massively ParallelCUDA/OpenCL
Background
Programming Models
SequentialC/C++
Massively ParallelCUDA/OpenCLPthreads/OpenMP
Standard API in C!
OpenMP example
#pragma omp parallel for num_threads(2) private(i)for (i = 0; i < SIZE; i++) { output[i] = A_array[i]*B_array[i];}
Pthread Examplestruct thread_data{ int start; int end;};
int main() { pthread_t thread1, thread2; struct thread_data data1, data2;
data1.start = 0; data1.end = SIZE/2; data2.start = SIZE/2; data2.end = SIZE;
pthread_create( &thread1, NULL, product, (void*)&data1); pthread_create( &thread2, NULL, product, (void*)&data2);
pthread_join( thread1, NULL); pthread_join( thread2, NULL);}
void *product(void *threadarg) { int i, startIdx, endIdx;
struct thread_data* arg = (struct thread_data*) threadarg; stardIdx = arg->start; endIdx = arg->end;
for (i = startIdx; i < endIdx; i++){ output[i] = A_array[i]*B_array[i]; }}
Pthreads vs OpenMP
• OpenMP provides an easy/implicit way for parallelizing a section of code (e.g. loops)
• Pthreads require explicit thread forks/joins• Pthreads can be more work but gives more
control to programmer• Pthreads can execute different functions in
parallel
OpenMP/Pthreads Support in LegUp
• Allow Pthreads and OpenMP to be used to specify parallel hardware.
• Automatically infer parallel-operating accelerators for the parallel-operating threads.
• Permits a easy exploration of a broad parallelization landscape.– Incl. support for nested parallelism.
Nested ParallelismPthreads
Nested ParallelismPthreads
add sub mult
Nested ParallelismPthreads
add sub mult
Nested ParallelismPthreads
add sub mult
OMP OMP OMP
Nested ParallelismPthreads
add sub mult
OMP OMP OMP
Nested Parallelism
Processor
On-chip Cache
Off-chip Mem
Accel 1 Accel 2 Accel 3
Nested Parallelism
Processor
Multi-ported Cache
Off-chip Mem
Accel 1 Accel 2 Accel 3
1 2 3 2 3 1 2 3
On-chip Cache
Processor1
Case Study
Computing the Mandelbrot Set
• Highly compute-bound application
• Each pixel is computed independently
• Fixed point calculations
• Our target image: 128x128 pixels
Target Platform: Altera Cyclone V-SoC
• 28nm FPGA with embedded dual-core ARM processor in the FPGA fabric– 800 MHz ARM with L1 + L2 caches– FPGA accelerators can access ARM cache
ARM processor
AlteraCyclone V FPGAfabric
Speed Performance Results
5.7X speed-upvs ARM SW
ARM software
1 HLS accel
2 HLS accels
4 HLS accels
8 HLS accels
0
5
10
15
20
25
30
0
20
40
60
80
100
120
140
Wall-clock time (ms)
MHz
High-Level Synthesis for Big Data
• Seeking big data applications we can collaborate on and accelerate with HLS
• Ideal characteristics:– Compute bound (not I/O bound)– Integer or fixed point (not floating point)– Data parallel
• Please reach out to us
Summary
• LegUp is an open-source high-level synthesis tool being developed at Univ. of Toronto.– Targets a hybrid FPGA-based
processor/accelerator system.– Distribution includes many benchmark programs
and other infrastructure.• Active development continues.
– Pthreads + OpenMP, debugging, memory architecture synthesis, improved HW quality.
Questions?
legup.eecg.toronto.edu