Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13,...

25
Advanced Computer Architecture, CSE 520 Generating FPGA- Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007

Transcript of Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13,...

Page 1: Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Advanced Computer Architecture, CSE 520

Generating FPGA-Accelerated DFT Libraries

Chi-Li YuNov. 13, 2007

Page 2: Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Advanced Computer Architecture, CSE 520

Overview

Application:1D/2D Discrete Fourier Transform

Problem:Hardware-Software PartitioningAcceleration Based on FPGA

Results (compared to software-only solution):

Up to 7.5 times higher performanceUp to 2.5 times better energy efficiency

Page 3: Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Advanced Computer Architecture, CSE 520

Why DFT?

Discrete Fourier Transform (DFT) is an important primitive underlying many DSP applications.

Imaging/speech processingCommunication systems

Computation-intensiveData/memory-intensive

Page 4: Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Advanced Computer Architecture, CSE 520

Review of DFT

21

0

DFT: [ ] [ ]nkN j

N

n

X k x n e

Requires N2 complex multiplies and N(N-1) complex additions

1

02/

1

02/

1

0

21

0

2

1

0

22

22

]12[]2[

)](12[)](2[

][][

][][

NN

NN

r

rkN

kN

r

rkN

r

rkN

kN

r

rkN

oddn

nkN

evenn

nkN

N

n

nkN

WrxWWrx

WrxWWrx

WnxWnx

WnxkX

When N is a power-of-two, 2p:

Page 5: Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Advanced Computer Architecture, CSE 520

Pipelined streaming architecture of FFT

Data flow diagram of Fast Fourier Transform (FFT)

Pipelined streaming architecture(Throughput: 1 sample/clock)

Page 6: Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Advanced Computer Architecture, CSE 520

Problem

Pure hardware implementationN should be a power-of-twoN is usually fixed

Arbitrary sized DFT is hard to be implementedFlexible programmability/Fast execution timeHardware-Software heterogeneous architectureHW-SW partitioning

Page 7: Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Advanced Computer Architecture, CSE 520

Principles of HW-SW partitioning

Hardware:The most computation intensive kernels that are conducive to hardware acceleration are extracted from an algorithm and realized as hardware.

Software:Remaining computations are carried out in software.Control-intensive part.

Xilinx Virtex-II Pro Platform FPGA

Page 8: Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Advanced Computer Architecture, CSE 520

Xilinx Virtex-II Pro Platform FPGA

Field Programmable Gate Array: FPGAProcess: 0.13um, 1.5vFlexible Logic Resources

Up to 1M gate-count capacityUp to 8 Mb of True Dual-Port RAM

Embedded IBM PowerPC 405 RISC processor blocks

provide performance up to 400 MHz

Page 9: Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Advanced Computer Architecture, CSE 520

The way to achieve hardware acceleration for DFT

When considering power-of-2 problem sizes (i.e., DFTs on 2p points), we only need to consider two-power sized DFT kernels (i.e., DFT2

q ).

By off-loading the appropriate kernels into hardware, the software receives the benefit of hardware acceleration and yet can still compute arbitrary sized DFTs on top of the available kernels.

Page 10: Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Advanced Computer Architecture, CSE 520

Research problem

Different kernels in hardware yield Different performance (e.g., operations per second) Different amounts of resources (e.g., logic, number of BRAM, or power consumption).

DFT partitioning problem Selecting the appropriate set of throughput optimized two-power sized DFT cores to satisfy a given resource constraint (logic, power, energy) while maximizing a scalar metric, such as performance.

Page 11: Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Advanced Computer Architecture, CSE 520

Test platform based on the FPGA

Notice that the data cache of PowerPC is 16kB.

Page 12: Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Advanced Computer Architecture, CSE 520

Architecture of the generated hardware DFT IP cores

FPGA

Page 13: Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Advanced Computer Architecture, CSE 520

DFT Performance (N is a power-of-two)

The highest performance is reached at the core’s native size.Data does not fit into data cache at N = 8192.

Memory bandwidth becomes the main bottleneck and practically reduces all possible speedups.

Page 14: Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Advanced Computer Architecture, CSE 520

DFT Performance (N is not a power-of-two)

N=3*2k and N=5*2k

Radix-3 and Radix-5 operations are done in software.

Page 15: Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Advanced Computer Architecture, CSE 520

DFT Precision

Page 16: Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Advanced Computer Architecture, CSE 520

1D DFT with different core sizes

Up to 7.5 times speedup.The best choice depends on the targeted applications.For small problem sizes, software is the most energy-efficient choice.

Page 17: Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Advanced Computer Architecture, CSE 520

2D DFT with different core sizes

Up to 4 times speedup.Again, for small problem sizes, software is the most energy-efficient choice.All sizes larger than or equal to 64x128 do not fit into data cache of PPC, which leads to a performance degradation.

Page 18: Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Advanced Computer Architecture, CSE 520

Area/performance

There is also a 3 times variation in the power consumed by the DFT calculations.

In other words, by allowing up to 3 times more power (or 4 times more area) to be consumed, one can speed up a whole library up to 4 times (averaged across the library).

Page 19: Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Advanced Computer Architecture, CSE 520

Power/performance

There is a 4 times variation in both area consumption and normalized runtime across all possible.

Page 20: Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Advanced Computer Architecture, CSE 520

Conclusions

In the experiments on a Xilinx Virtex-II Pro, the automatically partitioned and generated FPGA-accelerated library has between 2 and 7.5 times higher performance and up to 2.5 times better energy efficiency than the software-only version.We have integrated this approach in the “Spiral linear-transform code-generation framework” to support push-button automatic implementation.

Page 21: Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Advanced Computer Architecture, CSE 520

Conclusions

Architectures with tightly integrated FPGAs and general purpose processors are starting to play an important role in both embedded and high performance computing settings.The tight integration makes it possible to offload fine and coarse grain functionalities from processors to the FPGA fabric, combining the strengths of both components.

Page 22: Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Advanced Computer Architecture, CSE 520

My critiques about this paper

Strength: Detailed analysis on the HW-SW partitioning.Comparisons on performance and energy efficiency are very valuable.

Weakness:2D DFT on this platform is not efficient.Communications between PPC and FPGA slow down the whole operation.

Page 23: Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Advanced Computer Architecture, CSE 520

What is relative to our class?

A heterogeneous architecture combining two different cores: one RISC CPU and one programmable hardware, FPGA.Discussions on the power consumption of this kind of platform are interesting.

Page 24: Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Advanced Computer Architecture, CSE 520

What is relative to our project?

The same applicationsDiscrete Fourier Transform.

The same platformXilinx FPGA

Reduce the workload of PPC.Introduce the concept of multi-core architectures to our hardware design.

Page 25: Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Advanced Computer Architecture, CSE 520

Paper

Paolo D’Alberto, et al., “Generating FPGA-Accelerated DFT Libraries,” in Proceedings of 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'07), pp. 173-184, Napa Valley, CA, US, 23-25th, April 2007.