Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use...

32
1

Transcript of Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use...

Page 1: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point

1

Page 2: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point

Power Efficient Solutions w/ FPGAs

Bill Jenkins

Altera Sr. Product Specialist for Programming Language

Solutions

Page 3: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point

I/O I/O

System Challenges

Market Reaction: Growth of customized hardware and architectures…

3

Memory

Result:

Slow

Performance

(high latency)

CPUArchitecture is

inefficient for most

parallel computing

applications

(big data, search)

CPUResult:

Excessive power

consumption

CPUBottlenecks

are starving

the CPU

for data

Bottleneck

BottleneckBottleneck

Page 4: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point

Role of FPGA

Resource Sharing

Virtualization of computation,

Storage, Networking

Accelerators

Network Acceleration, Hypervisor

offload

Data Access Acceleration

Algorithm Acceleration

Cluster Computing

CPU and FPGA

Cluster Fabric

Cluster Interconnect

Host CPU

DR

AM

4

FPGA

Page 5: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point

FPGAs can greatly enhance CPU-based data center processing by

accelerating algorithms and minimizing bottlenecks

FPGAs Increase Efficiency in the Data Center

Massively parallel architecture

Has 10 to 100 times the number of

computational units

Enables pipelined designs that perform

multiple / different instructions in a single

clock cycle

Better localized memory avoids bottlenecks

Programmability enables

application-specific accelerators

5

10X+ increase in performance per watt

>5M Logic Elements

1.5TFLOPs Floating

Point DSP

Programmable I/O

3200Mbps DDR4

SDRAM/ 2.5Tbps

HMC

Page 6: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point

Mapping a simple program to an FPGA

6

R0 Load Mem[100]R1 Load Mem[101]R2 Load #42R2 Mul R1, R2R0 Add R2, R0Store R0 Mem[100]

High-level code

Mem[100] += 42 * Mem[101]

CPU instructions

Page 7: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point

B

A

AALU

First let’s take a look at execution on a simple CPU

7

Op

Val

Instruction

Fetch

Registers

Aaddr

Baddr

Caddr

PC Load StoreLdAddr StAddr

CWriteEnable

C

Op

LdData

StData

Op

CData

Fixed and general

architecture:

- General “cover-all-cases” data-paths

- Fixed data-widths

- Fixed operations

Page 8: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point

B

A

AALU

Load constant value into register

8

Very inefficient use of hardware!

Op

Val

Instruction

Fetch

Registers

Aaddr

Baddr

Caddr

PC Load StoreLdAddr StAddr

CWriteEnable

C

Op

LdData

StData

Op

CData

Page 9: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point

CPU activity, step by step

9

AR0 Load Mem[100]

AR1 Load Mem[101]

AR2 Load #42

AR2 Mul R1, R2

AR0 Add R2, R0

Store R0 Mem[100]A

Time

Page 10: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point

On the FPGA we unroll the CPU hardware…

10

A

A

A

A

A

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100]A

Space

Page 11: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point

… and specialize by position

11

A

A

A

A

A

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100]A

1. Instructions are fixed.

Remove “Fetch”

Page 12: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point

… and specialize

12

A

A

A

A

A

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100]A

1. Instructions are fixed.

Remove “Fetch”

2. Remove unused ALU ops

Page 13: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point

… and specialize

13

A

A

A

A

A

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100]A

1. Instructions are fixed.

Remove “Fetch”

2. Remove unused ALU ops

3. Remove unused Load / Store

Page 14: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point

… and specialize

14

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100]

1. Instructions are fixed.

Remove “Fetch”

2. Remove unused ALU ops

3. Remove unused Load / Store

4. Wire up registers properly!

And propagate state.

Page 15: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point

… and specialize

15

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100]

1. Instructions are fixed.

Remove “Fetch”

2. Remove unused ALU ops

3. Remove unused Load / Store

4. Wire up registers properly!

And propagate state.

5. Remove dead data.

Page 16: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point

… and specialize

16

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100]

1. Instructions are fixed.

Remove “Fetch”

2. Remove unused ALU ops

3. Remove unused Load / Store

4. Wire up registers properly!

And propagate state.

5. Remove dead data.

6. Reschedule!

Page 17: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point

Custom Data-Path on the FPGA Matches Your Algorithm!

17

Build exactly what you need:

Operations

Data widths

Memory size & configuration

Efficiency:

Throughput / Latency / Power

load load

store

42

High-level code

Mem[100] += 42 * Mem[101]

Custom data-path

Page 18: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point

Architectural Example: Image Processing

Convolutions: dataflow can proceed in pipelined fashion– No need to wait until the entire execution is complete

– Start a new set of data calculations as soon as the first stage completes its execution

Inew 𝑥 𝑦 =

𝑥′=−1

1

𝑦′=−1

1

Iold 𝑥 + 𝑥′ 𝑦 + 𝑦′ × F 𝑥′ 𝑦′

Page 19: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point

Main

Memory

Cache

Processor (CPU/GPU) Implementation

19

A cache can hide poor memory access patterns

for(int y=1; y<height-1; ++y) {for(int x=1; x<width-1; ++x) {for(int y2=-1; y2<1; ++y2) {for(int x2=-1; x2<1; ++x2) {i2[y][x] += i[y+y2][x+x2]

* filter[y2][x2];

CPU

Page 20: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point

FPGA Implementation

20

Example performance point: 1 pixel per cycle

Cache requirements: 9 reads + 1 write per cycle

Expensive hardware!– Power overhead

– Cost overhead: more built in addressing flexibility than we need

Why not customize the cache for the application?

CacheCustom

Data-path

9 read ports!Memory

Page 21: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point

Optimizing the “Cache”

21

Start out with the initial picture that is W pixels wide

w

Page 22: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point

Optimizing the “Cache”

22

ww

Let’s remove all the lines that aren’t in the neighborhood

of the window

Page 23: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point

Optimizing the “Cache”

23

Take all of the lines and arrange them as a 1D array of

pixels

ww

ww

w

ww

ww

w

ww w

Page 24: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point

Optimizing the “Cache”

24

ww

ww

w

ww

ww

w

Remove the pixels at the edges that we don’t need for the

computation

Page 25: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point

Optimizing the “Cache”

25

What happens when we move the window one pixel to the

right?

We have created a shift register implementation

w

Page 26: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point

data_out[9]

Shift Registers in Software

26

pixel_t sr[2*W+3];while(keep_going) {

// Shift data in#pragma unrollfor(int i=1; i<2*W+3; ++i)

sr[i] = sr[i-1]sr[0] = data_in;

// Tap output datadata_out = {sr[ 0], sr[ 1], sr[ 2],

sr[ w], sr[ w+1], sr[ w+2]sr[2*w], sr[2*w+1], sr[2*w+2]}

// ...}

wwdata_in

sr[0] sr[2*W+2]

Managing data

movement to match

the FPGA’s

architectural strengths

is key to obtaining

high performance

Page 27: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point

Traditional OpenCL Implementation of a Pipeline(CPU/GPU)

High-latency: requires access to global memory

High memory-bandwidth

Requires host coordination to pass buffers from one kernel

to another

With a particular design example we achieved 183 Images/s

on a Stratix V PCIe card

Kernel 1 Kernel 2 Kernel 3

Global Memory (DDR)

Buffer Buffer Buffer Buffer

Page 28: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point

Leveraging Kernel-to-Kernel Channels

Low-latency communication between kernels

Significantly less memory bandwidth requirements

Host is not involved in coordinating communication between kernels

This implementation on the same Stratix V PCIe card resulted in 400 Images/s

Global Memory (DDR)

Buffer Buffer

Channels

Kernel 1 Kernel 2 Kernel 3

• Channel declaration:

• Create a queue:value_type channel();

• Channel write:

• Push data into the queue:void write_channel_altera(channel &ch, value_type data);

• Channel read:

• Pop the first element from the queuevalue_type read_channel_altera(channel &ch);

channel int my_channel;

write_channel_altera(my_channel, x);

int y = read_channel_altera(my_channel);

Page 29: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point

FPGA Code

Kernels are written as standard

building blocks that are

connected together through

channels

The concept of having multiple

concurrent kernels executing

simultaneously and

communicating directly on a

device is currently unique to

FPGAs– Offered as Vendor Extension

– Portable in OpenCL 2.0 through the concept of “OpenCL Pipes”

#pragma OPENCL_EXTENSION cl_altera_channel: enable

// Declaration of Channel API data types

channel float prod_k1_channel;channel float k1_k2_channel;channel float k2_k3_channel;channel float k3_res_channel;

__kernel void convolution_prod(int batch_id_begin,int batch_id_end,__global const volatile float * restrict

input_global){for(...) {write_channel_altera(prod_k1_channel,input_global[...]);

write_channel_altera(k1_k2_channel,input_global[...]);...}

}

Page 30: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point

Migration Between FPGAs

30

In OpenCL, a float uses soft logic in an older FPGAs

Gen10 FPGAs have hardened floating point logic built into

the DSP blocks

On Arria 10 using the same code results in processing

6800 Images/s

Stratix 10 expectations:– Large increase in floating point resources

– Higher internal frequencies achievable

– 1.6x-2x performance increase

– 12x-16x performance/watt efficiency versus Stratix V

Page 31: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point

Additional Improvements: IO Channels

31

Kernel Channels are between OpenCL kernels

IO Channels take data directly from and to IO interfaces in

the FPGA– Camera or video feed could be processed directly in the FPGA without

going through the host

– Result could be passed out to the graphics card to be displayed or back to host memory for the host to use

Private, Local and Global memory can now be used to

buffer as needed

Kernel

Channels/Pipes

Kernel 1 Kernel 2 Kernel 3

IO Channels

FPGA

Page 32: Altera Power Efficient Solutions · –Bandwidth limitations begin to dominate compute –Use direct kernel to kernel communication called channels Native support for floating point

Lessons Learned

Exploiting pipelining on the FPGA requires some attention to coding style to overcome the inherent assumptions of writing “software”

– FPGAs do not have caches

– Need to exploit data reuse in a more explicit way

The concept of dataflow pipelining will not realize its full potential if we write intermediate results to memory

– Bandwidth limitations begin to dominate compute

– Use direct kernel to kernel communication called channels

Native support for floating point on the FPGA allows order of magnitude performance increase

Code can be ported to newer FPGAs without modification to get performance increase

IO Channels can lower latency and improve performance even more by taking the host out of the processing chain even more