Higher Level Programming Abstractions for FPGAs
Transcript of Higher Level Programming Abstractions for FPGAs
© 2011 Altera Corporation
Higher Level Programming
Abstractions for FPGAs using OpenCL
Desh Singh
Supervising Principal Engineer
Altera Corporation
Toronto Technology Center
© 2011 Altera Corporation
2
Fine-GrainedArrays
FPGAsDSPsCPUs
Programmable Solutions: 1985-2002
Single Cores
Technology scaling favors programmability
© 2011 Altera Corporation
3
Fine-GrainedMassivelyParallelArrays
FPGAsDSPsCPUs
Programmable Solutions: 2002-20XX
Single CoresCoarse-GrainedMassivelyParallelProcessorArrays
Multi-CoresCoarse-GrainedCPUs and DSPs
Multi-Cores Arrays
Technology scaling favors programmability and
parallelism
© 2011 Altera Corporation
4
Programmable and Sequential
CPU
Program
Instructions
Reaching the limitAfter four decades of success…
© 2011 Altera Corporation
“The End of Denial Architecture”
- William J. Dally [DAC’2009]
Single thread processors are in denial about parallelism and locality
They provide two illusions:
Serial executionDenies parallelism
Tries to exploit parallelism with ILP – limited scalability
Flat memoryDenies locality
Tries to provide illusion with caches – very inefficient when working set doesn‘t fit in the cache
© 2011 Altera Corporation
Processor MemoryProcessor Memory
Shared External Memory
Exploit parallelism on a chip
Take advantage of Moore‘s law
Processors not getting faster, just wider
Keep the power consumption down
Use more transistors for information processing
Programmable and Parallel
Processor MemoryProcessor Memory
© 2011 Altera Corporation
7
FPGA : Ultimately Configurable Multicore
Processor Memory
Processor Memory
Processor Memory
Many coarse-grained processorsDifferent Implementation Options Small soft scalar processor
or Larger vector processor
or Customized hardware pipeline
Each with local memory
Each processor can exploit the fine grained parallelism of the FPGA to more efficiently implement it‘s ―program‖
Possibly heterogeneousOptimized for different tasks
Customizable to suit the needs of a particular application
© 2011 Altera Corporation
8
Programmemory
NiosII
Datamemory
Arbiter
Datamemory
Arbiter
C2H
AcceleratorInteger
ALU
Integer
ALU
Float
ALU
Boolean
RF
Float
RF
Integer
RF
Instruction Memory
Data Memory
Load/Store
Unit
Load/Store
Unit
Immediate
Unit
Instruction
Unit
load
=
Processor Possibilities
Scalar Soft Proc +
AcceleratorVLIW / Vector / TTA Soft Proc Custom Pipeline
Dedicated RTL CircuitryTraditional μProcessor
© 2011 Altera Corporation
MOTIVATING CASE STUDY :FINITE IMPULSE RESPONSE (FIR) FILTER
9
© 2011 Altera Corporation
FIR Filter Example
10
TAPS (N=7)
© 2011 Altera Corporation
Custom Multithreaded Pipeline
Throughput of 1
thread per cycle
is possible using
a direct HW
implementation
FPGA offers
custom pipeline
parallelism
which can be
perfectly tailored
to the FIR filter
11
Thread 0
12
3
4
i
ixihy )0()(]0[
i
ixihy )1()(]1[
i
ixihy )2()(]2[
i
ixihy )3()(]3[
i
ixihy )4()(]4[
Clo
ck C
ycle
s
© 2011 Altera Corporation
12
Absolute Performance Comparison (FIR)
0
20
40
60
80
100
120
140
160
180
CPU
GPU
FPGA
Perf
orm
ance (
mega s
am
ple
s/s
econd)
TAPS (N)
*Filter too large to fit without serialization
© 2011 Altera Corporation
Performance on Conventional Processors
Implementation of a straightforward loop
structure means that we require O(N) operations
for every value producedCostly as N can be much larger than the number of parallel
functional units present on most processors
A technique that offers higher throughput uses
signal processing theory and requires O(log N)
operations per value
FFT
N-point
FFT
N-point
IFFT
Xm(k)H(k)
x(n)
h(n)
© 2011 Altera Corporation
14
Power Comparison (FIR)
0
10
20
30
40
50
60
70
80
CPU
GPU
FPGA
TAPS (N)
Pow
er
(Watts)
© 2011 Altera Corporation
15
Performance-to-Power Ratio (FIR)
0
50
100
150
200
250
300
350
CPU
GPU
FPGA
TAPS (N)
Perf
orm
ance (
mega s
am
ple
s/s
econd/W
)
© 2011 Altera Corporation
FPGAs for Computation
Although the FIR filter is a simple example, it is
representative of a large class of applicationsLarge amounts of spatial locality
Computation can be expressed as a feed forward pipeline
The fine grained parallelism of the FPGA can be
used to create custom ―processors‖ that are orders of
magnitude more efficient than CPUs or GPUs
System may have many other types of ―processors‖
which may be customized in a similar manner
16
© 2011 Altera Corporation
CHALLENGES
17
© 2011 Altera Corporation
main(…)
{
for( … )
{
}
Software Programmer’s View
Programmers are used to software-like environmentsIdeas can easily be expressed in languages such as ‗C‘
Typically start with simple sequential program
Use parallel APIs / language extensions to exploit multi core for
additional performance.
Compilation times are almost instantaneous
Immediate feedback and rich debugging tools
18
main(…)
{
for( … )
{
}
Com
pile
rmain(…)
{
for( … )
{
}
© 2011 Altera Corporation
FPGA Hardware Design
19
Idle
Request
Ack
Do Useful
Work
Done
State Machines Datapaths
-101
=
Idle
Request
Ack
Do Useful
Work
Done-101
=
SO
C Inte
rconnect
125
MHz
250
MHz
Bridge
PC
IeC
ore
200
MHz
400
MHz
Contr
olle
r
Mem
ory
PH
Y
© 2011 Altera Corporation
Design Entry Complexity
Description of these circuits is done through
Hardware Design Languages such as VHDL or
Verilog
Incredibly detailed design must be done before a
first working version is possibleCycle by cycle behavior must be specified for every register in the
design
The complete flexibility of the FPGA means that the designer
needs to specify all aspects of the hardware circuit
Buffering, Arbitration, IP Core interfacing, etc
20
© 2011 Altera Corporation
FPGA CAD / Compilation is Complex
Sophisticated optimization algorithms are used in each step and lead to significantly longer runtimes than a software compile ( hours vs. minutes )
21
Synthesis
y
Logic
BlockLogic
Block
y
Technology
Mapping
Clustering
Placement
Routing
© 2011 Altera Corporation
Timing Closure Problems
Designers will often have to go through numerous
iterations to meet timing requirements
22
HDL
Synthesis,
Tech Map
Cluster
Place, Route
Timing Analysis
Design Change
Timing Not Met
© 2011 Altera Corporation
Processor Memory
Processor Memory
Processor Memory
Processor Memory
Processor Memory
Processor Memory
Processor Memory
Processor Memory
Design Scalability
Using RTL design entry, there is significant work in
porting applications from generation to generation of
FPGA technology
23
Today‘s FPGA
Future : Ever Increasing Logic Capacity
Ideally a 2x improvement
in logic capacity should
translate into 2x
performance
In addition to doubling the
datapath, control logic and
SOC interconnect need to
change as well
Processor Memory
Processor Memory
Processor Memory
Processor Memory
© 2011 Altera Corporation
Future Proofing
To reap the benefits of Moore‘s law, designs
must be described in such a way that they can
take advantage of larger logic capacities
HDL design description makes this difficultGenerally tweaked and optimized for a particular device rather
than parameterized for the general case
Hard to express abstractions which allow for general parallel
scaling
Cycle by cycle behavior may need to be altered slightly as cores
scale
Eg. Interconnection logic may need additional pipeline registers to
sustain throughput requirements
24
© 2011 Altera Corporation
Fundamental challenges
Implementing an algorithm on an FPGA is done
by designing hardwareDifficult to design, verify and code for scalable performance
Generally, software programmers will have
difficulty using FPGAs as massive multi-core
devices to accelerate parallel applications
Need a programming model that allows the
designer to think about the FPGA as a
configurable multi-core device
25
© 2011 Altera Corporation
An ideal programming environment …
Has the following characteristics:Based on a standard multicore programming model rather than something which is FPGA-specific
Abstracts away the underlying details of the hardware
VHDL / Verilog are similar to ―assembly language‖ programming
Useful in rare circumstances where the highest possible efficiency is needed
The price of abstraction is not too high
Still need to efficiently use the FPGA‘s resources to achieve high throughput / low area
Allows for software-like compilation & debug cycles
Faster compile times
Profiling & user feedback
26
© 2011 Altera Corporation
OPENCL :
BRIEF INTRODUCTION
27
© 2011 Altera Corporation
What is OpenCL
OpenCL is a programming model developed by the Khronos group to support silicon acceleration
An industry consortium creating open API standards
Enables software to leverage silicon acceleration
Commitment to royalty-free standardsMaking money from enabled products – not from the standards themselves
28
© 2011 Altera Corporation
OpenCL Driving Forces
Attempt at driving an industry standard parallel language across platforms
CPUs, GPUs, Cell Processors, DSPs, and even FPGAs
So far driven by applications from:The consumer space
Image Processing & Video Encoding
1080p video processing on mobile devices
Augmented reality & Computational Photography
Game programming
More sophisticated rendering algorithms
Scientific / High Performance Computing
Financial, Molecular Dynamics, Bioinformatics, etc.
29
© 2011 Altera Corporation
OpenCL
OpenCL is a parallel language that provides us with two distinct advantages
Parallelism is declared by the programmer
Data parallelism is expressed through the notion of parallel threads which are instances of computational kernels
Task parallelism is accomplished with the use of queues and events that allow us to coordinate the coarse grained control flow
Data storage and movement is explicit
Hierarchical Memory model
Registers
Accelerator Local Memory
Global off-chip memory
It is up to the programmer to manage their memories and bandwidth efficiently
30
© 2011 Altera Corporation
OpenCL Structure
Natural separation between the code that runs on
accelerators* and the code that manages those
acceleratorsThe management or ―host‖ code is pure software that can be
executed on any sort of conventional microprocessor
Soft processor, Embedded hard processor, external x86 processor
The kernel code is ‗C‘ with a minimal set of extensions that allows
for the specification of parallelism and memory hierarchy
Likely only a small fraction of the total code in the application
Used only for the most computationally intensive portions
* Accelerator = Processor + Memory combo
31
© 2011 Altera Corporation
OpenCL Host Program
Pure software written in standard ‗C‘
Communicates with the Accelerator Device via a
set of library routines which abstract the
communication between the host processor and
the kernels
32
main(){
read_data_from_file( … );maninpulate_data( … );
clEnqueueWriteBuffer( … );clEnqueueTask(…, my_kernel, …);clEnqueueReadBuffer( … );
display_result_to_user( … );}
Copy data from
Host to FPGA
Ask the FPGA
to run a
particular kernel
Copy data from
FPGA to Host
© 2011 Altera Corporation
OpenCL Kernels Data-parallel function
Defines many parallel threads
of execution
Each thread has an identifier
specified by ―get_global_id‖
Contains keyword extensions
to specify parallelism and
memory hierarchy
Executed by compute
objectCPU
GPU
Accelerator
33
__kernel voidsum(__global const float *a,__global const float *b,__global float *answer){int xid = get_global_id(0);answer[xid] = a[xid] + b[xid];}
__kernel void sum( … );
float *a =
float *b =
float *answer =
0 1 2 3 4 5 6 7
7 6 5 4 3 2 1 0
7 7 7 7 7 7 7 7
© 2011 Altera Corporation
Altera supported system configurations
34
Host CPU FPGA
OpenCL
runtime
User
program
Accelerator
External host
Embedded CPUFPGA
OpenCL
runtime
Accelerator
User
program
Embedded
© 2011 Altera Corporation
OpenCL FPGA Target
35
AcceleratorAccelerator
AcceleratorAccelerator
AcceleratorAccelerator
AcceleratorAcceleratorAccelerator
/
Datatapath
Computation
AcceleratorAccelerator
AcceleratorAccelerator
SOPC / QSYS
Embedded
Soft or Hard
ProcessorPCIe
External
Memory
Controller
& PHY
DDR*
Interface IP
Application
Specific
External
Protocols
Accelerator
/
Datatapath
Computation
Accelerator
/
Datatapath
Computation
Host Program
Kernels
Processor
Mem
ory
Processor
Mem
ory
Processor
Mem
ory
© 2011 Altera Corporation
OpenCL to FPGA Challenges
OpenCL‘s compute model targets an ―abstract
machine‖ that is not an FPGAHierarchical array of processing elements
Corresponding hierarchical memory structure
It is more difficult to target an OpenCL program
to an FPGA than targeting more natural
hardware platforms such as CPUs and GPUsThese problems are research opportunities
36
© 2011 Altera Corporation
Compilation Flow
37
vectorAdd_kernel.clvectorAdd_host.c
CLANG
front end
System
DescriptionC
compiler
ACL
runtime
Library
program.elf
Optimizer
Unoptimized
LLVM IR
RTL
generatorVerilog
Optimized
LLVM IR
QSYS
Quartus
Third
Party
or
Academic
Tools
© 2011 Altera Corporation
OpenCL to FPGA Advantages
Standard multi-core programming modelOpenCL is supported by a consortium of companies trying
to drive a portable parallel programming model
38
Abstracts away underlying HW detailsOpenCL is based on standard ‗C‘ with a few extensions
Cycle by cycle behavior does not need to be specified
Eases coding of scalable solutionsApplication parallelism is specified by the programmer
Compiler & runtime routines distribute the workloads
depending on the characteristics of the accelerator
© 2011 Altera Corporation
OpenCL to FPGA Advantages
Addresses long compile timesTiming closure can be handled by the compilation flow
Registers can be inserted to shorten critical paths and
control logic is automatically adjusted
A large class of design changes can be handled
instantaneously
The OpenCL ―host‖ program is pure software that runs on a
standard microprocessor
This code is compiled using a standard ‗C‘ compiler
gcc, msvc++, etc
39
© 2011 Altera Corporation
Summary
OpenCL is a standard multi-core programming
model that can be used to provide a higher-level
layer of abstraction for FPGAs
Research challenges aboundNeed to collaborate with academics, third parties and members of
the Khronos group
Require libraries, kernel compilers, debugging tools, pre-defined
templates, etc.
We have to consider that our ―competition‖ is no longer just other
FPGA vendors
A broad spectrum of programmable multi-core devices targeting
different market segments
40
© 2011 Altera Corporation—Confidential
ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS and STRATIX words and logos are trademarks of Altera Corporation
and registered in the United States and are trademarks or registered trademarks in other countries.
Thank You