Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

Higher Level Programming

Abstractions for FPGAs using OpenCL

Desh Singh

Supervising Principal Engineer

Altera Corporation

Toronto Technology Center


2

Fine-GrainedArrays

FPGAsDSPsCPUs

Programmable Solutions: 1985-2002

Single Cores

Technology scaling favors programmability


3

Fine-GrainedMassivelyParallelArrays

FPGAsDSPsCPUs

Programmable Solutions: 2002-20XX

Single CoresCoarse-GrainedMassivelyParallelProcessorArrays

Multi-CoresCoarse-GrainedCPUs and DSPs

Multi-Cores Arrays

Technology scaling favors programmability and

parallelism


4

Programmable and Sequential

CPU

Program

Instructions

Reaching the limitAfter four decades of success…


“The End of Denial Architecture”

- William J. Dally [DAC’2009]

Single thread processors are in denial about parallelism and locality

They provide two illusions:

Serial executionDenies parallelism

Tries to exploit parallelism with ILP – limited scalability

Flat memoryDenies locality

Tries to provide illusion with caches – very inefficient when working set doesn‘t fit in the cache


Processor MemoryProcessor Memory

Shared External Memory

Exploit parallelism on a chip

Take advantage of Moore‘s law

Processors not getting faster, just wider

Keep the power consumption down

Use more transistors for information processing

Programmable and Parallel

Processor MemoryProcessor Memory


7

FPGA : Ultimately Configurable Multicore

Processor Memory

Processor Memory

Processor Memory

Many coarse-grained processorsDifferent Implementation Options Small soft scalar processor

or Larger vector processor

or Customized hardware pipeline

Each with local memory

Each processor can exploit the fine grained parallelism of the FPGA to more efficiently implement it‘s ―program‖

Possibly heterogeneousOptimized for different tasks

Customizable to suit the needs of a particular application


8

Programmemory

NiosII

Datamemory

Arbiter

Datamemory

Arbiter

C2H

AcceleratorInteger

ALU

Integer

ALU

Float

ALU

Boolean

RF

Float

RF

Integer

RF

Instruction Memory

Data Memory

Load/Store

Unit

Load/Store

Unit

Immediate

Unit

Instruction

Unit

load

=

Processor Possibilities

Scalar Soft Proc +

AcceleratorVLIW / Vector / TTA Soft Proc Custom Pipeline

Dedicated RTL CircuitryTraditional μProcessor


MOTIVATING CASE STUDY :FINITE IMPULSE RESPONSE (FIR) FILTER

9


FIR Filter Example

10

TAPS (N=7)


Custom Multithreaded Pipeline

Throughput of 1

thread per cycle

is possible using

a direct HW

implementation

FPGA offers

custom pipeline

parallelism

which can be

perfectly tailored

to the FIR filter

11

Thread 0

12

3

4

i

ixihy )0()(]0[

i

ixihy )1()(]1[

i

ixihy )2()(]2[

i

ixihy )3()(]3[

i

ixihy )4()(]4[

Clo

ck C

ycle

s


12

Absolute Performance Comparison (FIR)

0

20

40

60

80

100

120

140

160

180

CPU

GPU

FPGA

Perf

orm

ance (

mega s

am

ple

s/s

econd)

TAPS (N)

*Filter too large to fit without serialization


Performance on Conventional Processors

Implementation of a straightforward loop

structure means that we require O(N) operations

for every value producedCostly as N can be much larger than the number of parallel

functional units present on most processors

A technique that offers higher throughput uses

signal processing theory and requires O(log N)

operations per value

FFT

N-point

FFT

N-point

IFFT

Xm(k)H(k)

x(n)

h(n)


14

Power Comparison (FIR)

0

10

20

30

40

50

60

70

80

CPU

GPU

FPGA

TAPS (N)

Pow

er

(Watts)


15

Performance-to-Power Ratio (FIR)

0

50

100

150

200

250

300

350

CPU

GPU

FPGA

TAPS (N)

Perf

orm

ance (

mega s

am

ple

s/s

econd/W

)


FPGAs for Computation

Although the FIR filter is a simple example, it is

representative of a large class of applicationsLarge amounts of spatial locality

Computation can be expressed as a feed forward pipeline

The fine grained parallelism of the FPGA can be

used to create custom ―processors‖ that are orders of

magnitude more efficient than CPUs or GPUs

System may have many other types of ―processors‖

which may be customized in a similar manner

16


CHALLENGES

17


main(…)

{

for( … )

{

}

Software Programmer’s View

Programmers are used to software-like environmentsIdeas can easily be expressed in languages such as ‗C‘

Typically start with simple sequential program

Use parallel APIs / language extensions to exploit multi core for

additional performance.

Compilation times are almost instantaneous

Immediate feedback and rich debugging tools

18

main(…)

{

for( … )

{

}

Com

pile

rmain(…)

{

for( … )

{

}


FPGA Hardware Design

19

Idle

Request

Ack

Do Useful

Work

Done

State Machines Datapaths

-101

=

Idle

Request

Ack

Do Useful

Work

Done-101

=

SO

C Inte

rconnect

125

MHz

250

MHz

Bridge

PC

IeC

ore

200

MHz

400

MHz

Contr

olle

r

Mem

ory

PH

Y


Design Entry Complexity

Description of these circuits is done through

Hardware Design Languages such as VHDL or

Verilog

Incredibly detailed design must be done before a

first working version is possibleCycle by cycle behavior must be specified for every register in the

design

The complete flexibility of the FPGA means that the designer

needs to specify all aspects of the hardware circuit

Buffering, Arbitration, IP Core interfacing, etc

20


FPGA CAD / Compilation is Complex

Sophisticated optimization algorithms are used in each step and lead to significantly longer runtimes than a software compile ( hours vs. minutes )

21

Synthesis

y

Logic

BlockLogic

Block

y

Technology

Mapping

Clustering

Placement

Routing


Timing Closure Problems

Designers will often have to go through numerous

iterations to meet timing requirements

22

HDL

Synthesis,

Tech Map

Cluster

Place, Route

Timing Analysis

Design Change

Timing Not Met


Processor Memory

Processor Memory

Processor Memory

Processor Memory

Processor Memory

Processor Memory

Processor Memory

Processor Memory

Design Scalability

Using RTL design entry, there is significant work in

porting applications from generation to generation of

FPGA technology

23

Today‘s FPGA

Future : Ever Increasing Logic Capacity

Ideally a 2x improvement

in logic capacity should

translate into 2x

performance

In addition to doubling the

datapath, control logic and

SOC interconnect need to

change as well

Processor Memory

Processor Memory

Processor Memory

Processor Memory


Future Proofing

To reap the benefits of Moore‘s law, designs

must be described in such a way that they can

take advantage of larger logic capacities

HDL design description makes this difficultGenerally tweaked and optimized for a particular device rather

than parameterized for the general case

Hard to express abstractions which allow for general parallel

scaling

Cycle by cycle behavior may need to be altered slightly as cores

scale

Eg. Interconnection logic may need additional pipeline registers to

sustain throughput requirements

24


Fundamental challenges

Implementing an algorithm on an FPGA is done

by designing hardwareDifficult to design, verify and code for scalable performance

Generally, software programmers will have

difficulty using FPGAs as massive multi-core

devices to accelerate parallel applications

Need a programming model that allows the

designer to think about the FPGA as a

configurable multi-core device

25


An ideal programming environment …

Has the following characteristics:Based on a standard multicore programming model rather than something which is FPGA-specific

Abstracts away the underlying details of the hardware

VHDL / Verilog are similar to ―assembly language‖ programming

Useful in rare circumstances where the highest possible efficiency is needed

The price of abstraction is not too high

Still need to efficiently use the FPGA‘s resources to achieve high throughput / low area

Allows for software-like compilation & debug cycles

Faster compile times

Profiling & user feedback

26


OPENCL :

BRIEF INTRODUCTION

27


What is OpenCL

OpenCL is a programming model developed by the Khronos group to support silicon acceleration

An industry consortium creating open API standards

Enables software to leverage silicon acceleration

Commitment to royalty-free standardsMaking money from enabled products – not from the standards themselves

28


OpenCL Driving Forces

Attempt at driving an industry standard parallel language across platforms

CPUs, GPUs, Cell Processors, DSPs, and even FPGAs

So far driven by applications from:The consumer space

Image Processing & Video Encoding

1080p video processing on mobile devices

Augmented reality & Computational Photography

Game programming

More sophisticated rendering algorithms

Scientific / High Performance Computing

Financial, Molecular Dynamics, Bioinformatics, etc.

29


OpenCL

OpenCL is a parallel language that provides us with two distinct advantages

Parallelism is declared by the programmer

Data parallelism is expressed through the notion of parallel threads which are instances of computational kernels

Task parallelism is accomplished with the use of queues and events that allow us to coordinate the coarse grained control flow

Data storage and movement is explicit

Hierarchical Memory model

Registers

Accelerator Local Memory

Global off-chip memory

It is up to the programmer to manage their memories and bandwidth efficiently

30


OpenCL Structure

Natural separation between the code that runs on

accelerators* and the code that manages those

acceleratorsThe management or ―host‖ code is pure software that can be

executed on any sort of conventional microprocessor

Soft processor, Embedded hard processor, external x86 processor

The kernel code is ‗C‘ with a minimal set of extensions that allows

for the specification of parallelism and memory hierarchy

Likely only a small fraction of the total code in the application

Used only for the most computationally intensive portions

* Accelerator = Processor + Memory combo

31


OpenCL Host Program

Pure software written in standard ‗C‘

Communicates with the Accelerator Device via a

set of library routines which abstract the

communication between the host processor and

the kernels

32

main(){

read_data_from_file( … );maninpulate_data( … );

clEnqueueWriteBuffer( … );clEnqueueTask(…, my_kernel, …);clEnqueueReadBuffer( … );

display_result_to_user( … );}

Copy data from

Host to FPGA

Ask the FPGA

to run a

particular kernel

Copy data from

FPGA to Host


OpenCL Kernels Data-parallel function

Defines many parallel threads

of execution

Each thread has an identifier

specified by ―get_global_id‖

Contains keyword extensions

to specify parallelism and

memory hierarchy

Executed by compute

objectCPU

GPU

Accelerator

33

__kernel voidsum(__global const float *a,__global const float *b,__global float *answer){int xid = get_global_id(0);answer[xid] = a[xid] + b[xid];}

__kernel void sum( … );

float *a =

float *b =

float *answer =

0 1 2 3 4 5 6 7

7 6 5 4 3 2 1 0

7 7 7 7 7 7 7 7


Altera supported system configurations

34

Host CPU FPGA

OpenCL

runtime

User

program

Accelerator

External host

Embedded CPUFPGA

OpenCL

runtime

Accelerator

User

program

Embedded


OpenCL FPGA Target

35

AcceleratorAccelerator



AcceleratorAcceleratorAccelerator

/

Datatapath

Computation



SOPC / QSYS

Embedded

Soft or Hard

ProcessorPCIe

External

Memory

Controller

& PHY

DDR*

Interface IP

Application

Specific

External

Protocols

Accelerator

/

Datatapath

Computation

Accelerator

/

Datatapath

Computation

Host Program

Kernels

Processor

Mem

ory

Processor

Mem

ory

Processor

Mem

ory


OpenCL to FPGA Challenges

OpenCL‘s compute model targets an ―abstract

machine‖ that is not an FPGAHierarchical array of processing elements

Corresponding hierarchical memory structure

It is more difficult to target an OpenCL program

to an FPGA than targeting more natural

hardware platforms such as CPUs and GPUsThese problems are research opportunities

36


Compilation Flow

37

vectorAdd_kernel.clvectorAdd_host.c

CLANG

front end

System

DescriptionC

compiler

ACL

runtime

Library

program.elf

Optimizer

Unoptimized

LLVM IR

RTL

generatorVerilog

Optimized

LLVM IR

QSYS

Quartus

Third

Party

or

Academic

Tools


OpenCL to FPGA Advantages

Standard multi-core programming modelOpenCL is supported by a consortium of companies trying

to drive a portable parallel programming model

38

Abstracts away underlying HW detailsOpenCL is based on standard ‗C‘ with a few extensions

Cycle by cycle behavior does not need to be specified

Eases coding of scalable solutionsApplication parallelism is specified by the programmer

Compiler & runtime routines distribute the workloads

depending on the characteristics of the accelerator


OpenCL to FPGA Advantages

Addresses long compile timesTiming closure can be handled by the compilation flow

Registers can be inserted to shorten critical paths and

control logic is automatically adjusted

A large class of design changes can be handled

instantaneously

The OpenCL ―host‖ program is pure software that runs on a

standard microprocessor

This code is compiled using a standard ‗C‘ compiler

gcc, msvc++, etc

39


Summary

OpenCL is a standard multi-core programming

model that can be used to provide a higher-level

layer of abstraction for FPGAs

Research challenges aboundNeed to collaborate with academics, third parties and members of

the Khronos group

Require libraries, kernel compilers, debugging tools, pre-defined

templates, etc.

We have to consider that our ―competition‖ is no longer just other

FPGA vendors

A broad spectrum of programmable multi-core devices targeting

different market segments

40

© 2011 Altera Corporation—Confidential

ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS and STRATIX words and logos are trademarks of Altera Corporation

and registered in the United States and are trademarks or registered trademarks in other countries.

Thank You

Higher Level Programming Abstractions for FPGAs

Documents

Transcript of Higher Level Programming Abstractions for FPGAs