Higher Level Programming Abstractions for FPGAs

41
© 2011 Altera Corporation Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center

Transcript of Higher Level Programming Abstractions for FPGAs

Page 1: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

Higher Level Programming

Abstractions for FPGAs using OpenCL

Desh Singh

Supervising Principal Engineer

Altera Corporation

Toronto Technology Center

Page 2: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

2

Fine-GrainedArrays

FPGAsDSPsCPUs

Programmable Solutions: 1985-2002

Single Cores

Technology scaling favors programmability

Page 3: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

3

Fine-GrainedMassivelyParallelArrays

FPGAsDSPsCPUs

Programmable Solutions: 2002-20XX

Single CoresCoarse-GrainedMassivelyParallelProcessorArrays

Multi-CoresCoarse-GrainedCPUs and DSPs

Multi-Cores Arrays

Technology scaling favors programmability and

parallelism

Page 4: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

4

Programmable and Sequential

CPU

Program

Instructions

Reaching the limitAfter four decades of success…

Page 5: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

“The End of Denial Architecture”

- William J. Dally [DAC’2009]

Single thread processors are in denial about parallelism and locality

They provide two illusions:

Serial executionDenies parallelism

Tries to exploit parallelism with ILP – limited scalability

Flat memoryDenies locality

Tries to provide illusion with caches – very inefficient when working set doesn‘t fit in the cache

Page 6: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

Processor MemoryProcessor Memory

Shared External Memory

Exploit parallelism on a chip

Take advantage of Moore‘s law

Processors not getting faster, just wider

Keep the power consumption down

Use more transistors for information processing

Programmable and Parallel

Processor MemoryProcessor Memory

Page 7: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

7

FPGA : Ultimately Configurable Multicore

Processor Memory

Processor Memory

Processor Memory

Many coarse-grained processorsDifferent Implementation Options Small soft scalar processor

or Larger vector processor

or Customized hardware pipeline

Each with local memory

Each processor can exploit the fine grained parallelism of the FPGA to more efficiently implement it‘s ―program‖

Possibly heterogeneousOptimized for different tasks

Customizable to suit the needs of a particular application

Page 8: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

8

Programmemory

NiosII

Datamemory

Arbiter

Datamemory

Arbiter

C2H

AcceleratorInteger

ALU

Integer

ALU

Float

ALU

Boolean

RF

Float

RF

Integer

RF

Instruction Memory

Data Memory

Load/Store

Unit

Load/Store

Unit

Immediate

Unit

Instruction

Unit

load

=

Processor Possibilities

Scalar Soft Proc +

AcceleratorVLIW / Vector / TTA Soft Proc Custom Pipeline

Dedicated RTL CircuitryTraditional μProcessor

Page 9: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

MOTIVATING CASE STUDY :FINITE IMPULSE RESPONSE (FIR) FILTER

9

Page 10: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

FIR Filter Example

10

TAPS (N=7)

Page 11: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

Custom Multithreaded Pipeline

Throughput of 1

thread per cycle

is possible using

a direct HW

implementation

FPGA offers

custom pipeline

parallelism

which can be

perfectly tailored

to the FIR filter

11

Thread 0

12

3

4

i

ixihy )0()(]0[

i

ixihy )1()(]1[

i

ixihy )2()(]2[

i

ixihy )3()(]3[

i

ixihy )4()(]4[

Clo

ck C

ycle

s

Page 12: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

12

Absolute Performance Comparison (FIR)

0

20

40

60

80

100

120

140

160

180

CPU

GPU

FPGA

Perf

orm

ance (

mega s

am

ple

s/s

econd)

TAPS (N)

*Filter too large to fit without serialization

Page 13: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

Performance on Conventional Processors

Implementation of a straightforward loop

structure means that we require O(N) operations

for every value producedCostly as N can be much larger than the number of parallel

functional units present on most processors

A technique that offers higher throughput uses

signal processing theory and requires O(log N)

operations per value

FFT

N-point

FFT

N-point

IFFT

Xm(k)H(k)

x(n)

h(n)

Page 14: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

14

Power Comparison (FIR)

0

10

20

30

40

50

60

70

80

CPU

GPU

FPGA

TAPS (N)

Pow

er

(Watts)

Page 15: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

15

Performance-to-Power Ratio (FIR)

0

50

100

150

200

250

300

350

CPU

GPU

FPGA

TAPS (N)

Perf

orm

ance (

mega s

am

ple

s/s

econd/W

)

Page 16: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

FPGAs for Computation

Although the FIR filter is a simple example, it is

representative of a large class of applicationsLarge amounts of spatial locality

Computation can be expressed as a feed forward pipeline

The fine grained parallelism of the FPGA can be

used to create custom ―processors‖ that are orders of

magnitude more efficient than CPUs or GPUs

System may have many other types of ―processors‖

which may be customized in a similar manner

16

Page 17: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

CHALLENGES

17

Page 18: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

main(…)

{

for( … )

{

}

Software Programmer’s View

Programmers are used to software-like environmentsIdeas can easily be expressed in languages such as ‗C‘

Typically start with simple sequential program

Use parallel APIs / language extensions to exploit multi core for

additional performance.

Compilation times are almost instantaneous

Immediate feedback and rich debugging tools

18

main(…)

{

for( … )

{

}

Com

pile

rmain(…)

{

for( … )

{

}

Page 19: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

FPGA Hardware Design

19

Idle

Request

Ack

Do Useful

Work

Done

State Machines Datapaths

-101

=

Idle

Request

Ack

Do Useful

Work

Done-101

=

SO

C Inte

rconnect

125

MHz

250

MHz

Bridge

PC

IeC

ore

200

MHz

400

MHz

Contr

olle

r

Mem

ory

PH

Y

Page 20: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

Design Entry Complexity

Description of these circuits is done through

Hardware Design Languages such as VHDL or

Verilog

Incredibly detailed design must be done before a

first working version is possibleCycle by cycle behavior must be specified for every register in the

design

The complete flexibility of the FPGA means that the designer

needs to specify all aspects of the hardware circuit

Buffering, Arbitration, IP Core interfacing, etc

20

Page 21: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

FPGA CAD / Compilation is Complex

Sophisticated optimization algorithms are used in each step and lead to significantly longer runtimes than a software compile ( hours vs. minutes )

21

Synthesis

y

Logic

BlockLogic

Block

y

Technology

Mapping

Clustering

Placement

Routing

Page 22: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

Timing Closure Problems

Designers will often have to go through numerous

iterations to meet timing requirements

22

HDL

Synthesis,

Tech Map

Cluster

Place, Route

Timing Analysis

Design Change

Timing Not Met

Page 23: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

Processor Memory

Processor Memory

Processor Memory

Processor Memory

Processor Memory

Processor Memory

Processor Memory

Processor Memory

Design Scalability

Using RTL design entry, there is significant work in

porting applications from generation to generation of

FPGA technology

23

Today‘s FPGA

Future : Ever Increasing Logic Capacity

Ideally a 2x improvement

in logic capacity should

translate into 2x

performance

In addition to doubling the

datapath, control logic and

SOC interconnect need to

change as well

Processor Memory

Processor Memory

Processor Memory

Processor Memory

Page 24: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

Future Proofing

To reap the benefits of Moore‘s law, designs

must be described in such a way that they can

take advantage of larger logic capacities

HDL design description makes this difficultGenerally tweaked and optimized for a particular device rather

than parameterized for the general case

Hard to express abstractions which allow for general parallel

scaling

Cycle by cycle behavior may need to be altered slightly as cores

scale

Eg. Interconnection logic may need additional pipeline registers to

sustain throughput requirements

24

Page 25: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

Fundamental challenges

Implementing an algorithm on an FPGA is done

by designing hardwareDifficult to design, verify and code for scalable performance

Generally, software programmers will have

difficulty using FPGAs as massive multi-core

devices to accelerate parallel applications

Need a programming model that allows the

designer to think about the FPGA as a

configurable multi-core device

25

Page 26: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

An ideal programming environment …

Has the following characteristics:Based on a standard multicore programming model rather than something which is FPGA-specific

Abstracts away the underlying details of the hardware

VHDL / Verilog are similar to ―assembly language‖ programming

Useful in rare circumstances where the highest possible efficiency is needed

The price of abstraction is not too high

Still need to efficiently use the FPGA‘s resources to achieve high throughput / low area

Allows for software-like compilation & debug cycles

Faster compile times

Profiling & user feedback

26

Page 27: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

OPENCL :

BRIEF INTRODUCTION

27

Page 28: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

What is OpenCL

OpenCL is a programming model developed by the Khronos group to support silicon acceleration

An industry consortium creating open API standards

Enables software to leverage silicon acceleration

Commitment to royalty-free standardsMaking money from enabled products – not from the standards themselves

28

Page 29: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

OpenCL Driving Forces

Attempt at driving an industry standard parallel language across platforms

CPUs, GPUs, Cell Processors, DSPs, and even FPGAs

So far driven by applications from:The consumer space

Image Processing & Video Encoding

1080p video processing on mobile devices

Augmented reality & Computational Photography

Game programming

More sophisticated rendering algorithms

Scientific / High Performance Computing

Financial, Molecular Dynamics, Bioinformatics, etc.

29

Page 30: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

OpenCL

OpenCL is a parallel language that provides us with two distinct advantages

Parallelism is declared by the programmer

Data parallelism is expressed through the notion of parallel threads which are instances of computational kernels

Task parallelism is accomplished with the use of queues and events that allow us to coordinate the coarse grained control flow

Data storage and movement is explicit

Hierarchical Memory model

Registers

Accelerator Local Memory

Global off-chip memory

It is up to the programmer to manage their memories and bandwidth efficiently

30

Page 31: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

OpenCL Structure

Natural separation between the code that runs on

accelerators* and the code that manages those

acceleratorsThe management or ―host‖ code is pure software that can be

executed on any sort of conventional microprocessor

Soft processor, Embedded hard processor, external x86 processor

The kernel code is ‗C‘ with a minimal set of extensions that allows

for the specification of parallelism and memory hierarchy

Likely only a small fraction of the total code in the application

Used only for the most computationally intensive portions

* Accelerator = Processor + Memory combo

31

Page 32: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

OpenCL Host Program

Pure software written in standard ‗C‘

Communicates with the Accelerator Device via a

set of library routines which abstract the

communication between the host processor and

the kernels

32

main(){

read_data_from_file( … );maninpulate_data( … );

clEnqueueWriteBuffer( … );clEnqueueTask(…, my_kernel, …);clEnqueueReadBuffer( … );

display_result_to_user( … );}

Copy data from

Host to FPGA

Ask the FPGA

to run a

particular kernel

Copy data from

FPGA to Host

Page 33: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

OpenCL Kernels Data-parallel function

Defines many parallel threads

of execution

Each thread has an identifier

specified by ―get_global_id‖

Contains keyword extensions

to specify parallelism and

memory hierarchy

Executed by compute

objectCPU

GPU

Accelerator

33

__kernel voidsum(__global const float *a,__global const float *b,__global float *answer){int xid = get_global_id(0);answer[xid] = a[xid] + b[xid];}

__kernel void sum( … );

float *a =

float *b =

float *answer =

0 1 2 3 4 5 6 7

7 6 5 4 3 2 1 0

7 7 7 7 7 7 7 7

Page 34: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

Altera supported system configurations

34

Host CPU FPGA

OpenCL

runtime

User

program

Accelerator

External host

Embedded CPUFPGA

OpenCL

runtime

Accelerator

User

program

Embedded

Page 35: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

OpenCL FPGA Target

35

AcceleratorAccelerator

AcceleratorAccelerator

AcceleratorAccelerator

AcceleratorAcceleratorAccelerator

/

Datatapath

Computation

AcceleratorAccelerator

AcceleratorAccelerator

SOPC / QSYS

Embedded

Soft or Hard

ProcessorPCIe

External

Memory

Controller

& PHY

DDR*

Interface IP

Application

Specific

External

Protocols

Accelerator

/

Datatapath

Computation

Accelerator

/

Datatapath

Computation

Host Program

Kernels

Processor

Mem

ory

Processor

Mem

ory

Processor

Mem

ory

Page 36: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

OpenCL to FPGA Challenges

OpenCL‘s compute model targets an ―abstract

machine‖ that is not an FPGAHierarchical array of processing elements

Corresponding hierarchical memory structure

It is more difficult to target an OpenCL program

to an FPGA than targeting more natural

hardware platforms such as CPUs and GPUsThese problems are research opportunities

36

Page 37: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

Compilation Flow

37

vectorAdd_kernel.clvectorAdd_host.c

CLANG

front end

System

DescriptionC

compiler

ACL

runtime

Library

program.elf

Optimizer

Unoptimized

LLVM IR

RTL

generatorVerilog

Optimized

LLVM IR

QSYS

Quartus

Third

Party

or

Academic

Tools

Page 38: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

OpenCL to FPGA Advantages

Standard multi-core programming modelOpenCL is supported by a consortium of companies trying

to drive a portable parallel programming model

38

Abstracts away underlying HW detailsOpenCL is based on standard ‗C‘ with a few extensions

Cycle by cycle behavior does not need to be specified

Eases coding of scalable solutionsApplication parallelism is specified by the programmer

Compiler & runtime routines distribute the workloads

depending on the characteristics of the accelerator

Page 39: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

OpenCL to FPGA Advantages

Addresses long compile timesTiming closure can be handled by the compilation flow

Registers can be inserted to shorten critical paths and

control logic is automatically adjusted

A large class of design changes can be handled

instantaneously

The OpenCL ―host‖ program is pure software that runs on a

standard microprocessor

This code is compiled using a standard ‗C‘ compiler

gcc, msvc++, etc

39

Page 40: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation

Summary

OpenCL is a standard multi-core programming

model that can be used to provide a higher-level

layer of abstraction for FPGAs

Research challenges aboundNeed to collaborate with academics, third parties and members of

the Khronos group

Require libraries, kernel compilers, debugging tools, pre-defined

templates, etc.

We have to consider that our ―competition‖ is no longer just other

FPGA vendors

A broad spectrum of programmable multi-core devices targeting

different market segments

40

Page 41: Higher Level Programming Abstractions for FPGAs

© 2011 Altera Corporation—Confidential

ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS and STRATIX words and logos are trademarks of Altera Corporation

and registered in the United States and are trademarks or registered trademarks in other countries.

Thank You