Download - 1 Digital Signal Processing on Reconfigurable Computing Systems Oliver Liu ENGG*6090 : Reconfigurable Computing Systems Winter 2007.

11

Digital Signal Processing Digital Signal Processing on Reconfigurable on Reconfigurable

Computing SystemsComputing Systems Oliver LiuOliver Liu

ENGG*6090 : Reconfigurable Computing ENGG*6090 : Reconfigurable Computing SystemsSystems

Winter 2007Winter 2007

22

ReferencesReferences

Reconfigurable Computing: Accelerating Computation Reconfigurable Computing: Accelerating Computation with Field- Programmable-Gate-Array, with Field- Programmable-Gate-Array, Chapter 5. By Maya B. GokhaleChapter 5. By Maya B. GokhaleThe Design Warrior’s Guide to FPGAs, Chapter 12. By The Design Warrior’s Guide to FPGAs, Chapter 12. By C. Maxfeild.C. Maxfeild. Andrew Y. Lin, Implementation Consideration for FPGA-Andrew Y. Lin, Implementation Consideration for FPGA-Based Adaptive Transversal Filter Design. Master Based Adaptive Transversal Filter Design. Master Thesis, University of Florida,2003Thesis, University of Florida,2003Ali M. Al-Haj, Fast Discrete Wavelet Transformation Ali M. Al-Haj, Fast Discrete Wavelet Transformation Using FPGAs and Distributed Arithmetic. Department of Using FPGAs and Distributed Arithmetic. Department of Electronics Engineering, Princess Sumaya University for Electronics Engineering, Princess Sumaya University for Technology, Al-Jubeiha P.O. Box 1438, Amman 11941, Technology, Al-Jubeiha P.O. Box 1438, Amman 11941, Jordan, 2003Jordan, 2003

33

IntroductionIntroduction

Why Use Reconfigurable Computing for DSP?Why Use Reconfigurable Computing for DSP?Advantage and disadvantage of RC for DSP. Advantage and disadvantage of RC for DSP.

Explorations in parallel DSP processing in FPGAExplorations in parallel DSP processing in FPGA.. Some basic DSP Application Building BlocksSome basic DSP Application Building Blocks

MAC, Multiply-Accumulate Unit for DSP.MAC, Multiply-Accumulate Unit for DSP.

Bit_serial Adder, Parallel Distributed Arithmetic Multiplier.Bit_serial Adder, Parallel Distributed Arithmetic Multiplier.

DSP components.DSP components. Some FPGA Centric DSP Design ToolsSome FPGA Centric DSP Design Tools

Assembly, C/C++, Handle-C, RTL, Xilinx Core Generator, Assembly, C/C++, Handle-C, RTL, Xilinx Core Generator, Xilinx Core Generator, MAtlab/Simulink, Xilinx System Xilinx Core Generator, MAtlab/Simulink, Xilinx System GeneratorGenerator

44

Advantage and disadvantage of RC for DSP (1)Advantage and disadvantage of RC for DSP (1)

Technology Performance Cost Power Flexibility Memory BW I/O BW

GPP

PDSP

ASIC

FPGA

LOW

Med-High

HIGH

Medium

LOW

Medium

HIGH

Low

HIGH

Medium

LOW

Low-Medium

HIGH

Medium

LOW

HIGH

LOW

Medium

HIGH

HIGH

LOW

LOW

HIGH

HIGH

55

Advantage and disadvantage of RC for DSP (2)Advantage and disadvantage of RC for DSP (2)

AdvantagesAdvantages Parallel processingParallel processing capability achieve high performance. capability achieve high performance. flexible architecture flexible architecture reduce the riskreduce the risk of product development. of product development. Design can be changed Design can be changed during the evolution of the productduring the evolution of the product Word widthsWord widths can be flexible. can be flexible. Lower powerLower power than DSP. than DSP. Price Price is becoming lower.is becoming lower.

DisadvantagesDisadvantages Power consumption and performance is Power consumption and performance is lower than ASIClower than ASIC..

66

Explorations in parallel DSP processing in Explorations in parallel DSP processing in Reconfigurable Computing System (1)Reconfigurable Computing System (1)

a(0)

Data Out

a(1) a(2) a(3)

Data In

a(0)

Data Out

a(1) a(2) a(3)

Data In

Reg0 Reg1 Reg2 Reg3

Reg Reg Reg Reg

Reg Reg Reg Reg

Reg Reg

77

Explorations in parallelism DSP processing Explorations in parallelism DSP processing in Reconfigurable Computing System (2)in Reconfigurable Computing System (2)

Most DSP applications require several operations such Most DSP applications require several operations such as FIR filters, transforms, etc. to process each incoming as FIR filters, transforms, etc. to process each incoming data stream, providing the potential to exploit data stream, providing the potential to exploit coarse-coarse-grained parallelism in FPGA.grained parallelism in FPGA.DSP applications often use fixed coefficients or DSP applications often use fixed coefficients or constants throughout their applications. By “folding” the constants throughout their applications. By “folding” the constants directly into hardware, i.e., customizing the constants directly into hardware, i.e., customizing the hardware for giving constant, the hardware for giving constant, the area and speedarea and speed of of operations can be significantly improved.operations can be significantly improved.Reconfigurable computing’s ability to supply both flexible Reconfigurable computing’s ability to supply both flexible and significant and significant memory bandwidthmemory bandwidth also improves the also improves the possible parallelism that can be extracted in DSP possible parallelism that can be extracted in DSP applications.applications.

88

Some DSP Application Building Blocks (1)Some DSP Application Building Blocks (1)

The most commonly used DSP functions are The most commonly used DSP functions are FIR (Finite Impulse response) filters,FIR (Finite Impulse response) filters, IIR (Infinite Impulse response) filters, IIR (Infinite Impulse response) filters, FFT (Fast Fourier Transform), FFT (Fast Fourier Transform), DCT (Direct Cosine Transform),DCT (Direct Cosine Transform), Encoder/Decoder and Error Correction/Detection Encoder/Decoder and Error Correction/Detection

functions.functions.

All of these blocks perform intensive arithmetic All of these blocks perform intensive arithmetic operations such as operations such as add, subtract, multiply, multiply-add or multiply-add, subtract, multiply, multiply-add or multiply-

accumulate.accumulate.

99


A

x

+

x

+

A[n:0]

B[n:0] Y[(2n - 1):0]

Multiplier

Adder

Accumulator

MAC

MAC unit

B

Sum

Bit-Serial Adder unit

Q D

Clr Clk

1010


ROM16x12 bits

ROM16x12 bits

12 bit Adder

Input[7:0]

UPP[11:0] LPP[11:0]

Sum Sum

Input[7:4] Input[3:0]

Addr[3:0] Addr[3:0]

LPP[3:0]

8-bit by 8-bit Parallel Distributed Arithmetic Multiplier

1111


Efficient Memory Structures (Efficient Memory Structures (LUTsLUTs))

Filters - Filters - IIR, FIR, LMS, etcIIR, FIR, LMS, etc..

Fast Fourier Transforms (Fast Fourier Transforms (FFTFFT))

Discrete Cosine Transform (Discrete Cosine Transform (DCTDCT))

Discrete Wavelet Transform (Discrete Wavelet Transform (DWTDWT))

1212

Some FPGA Centric DSP Design Tools and Some FPGA Centric DSP Design Tools and LanguagesLanguages

Assembly, C, C++Assembly, C, C++

VHDL/Verilog (RTL code)VHDL/Verilog (RTL code)

Xilinx EDKXilinx EDK

Xilinx ISEXilinx ISE

Mentor Graphic ModelsimMentor Graphic Modelsim

Xilinx Core GeneratorXilinx Core Generator

MATLAB/SimulinkMATLAB/Simulink

Xilinx System GeneratorXilinx System Generator

1313

Topics CoveredTopics Covered

Implementation Consideration for FPGA-Implementation Consideration for FPGA-Based Adaptive Transversal Filter Design.Based Adaptive Transversal Filter Design.

Fast Discrete Wavelet Transformation Fast Discrete Wavelet Transformation Using FPGAs and Distributed ArithmeticUsing FPGAs and Distributed Arithmetic

ENG6090 Project Status: Image ENG6090 Project Status: Image Compression using Wavelet Filter Bank on Compression using Wavelet Filter Bank on Reconfigurable Computing SystemReconfigurable Computing System

1414

Implementation Consideration for Implementation Consideration for FPGA-Based Adaptive Transversal FPGA-Based Adaptive Transversal

Filter DesignFilter Design

Andrew Y. Lin, Andrew Y. Lin,

Master Thesis, University of Florida, 2003Master Thesis, University of Florida, 2003

1515

Problem Statement and Purpose of the Design Problem Statement and Purpose of the Design

Due to finite precisions in digital hardware, quantization must be performed in either or all of the following areas:

• Input and reference signals;

• Product quantization in convolution stage;

• Coefficient quantization in adaptation stage.

Quantization noise is introduced in all of the above areas. The effects of quantization are discussed in this

thesis. This thesis also investigates the performance among FGPAs and DSP processors in terms of speed and power consumption.

1616

Adaptive Algorithms, Finite Precision Effects on Adaptive Adaptive Algorithms, Finite Precision Effects on Adaptive Algorithms (1) -- Algorithms (1) -- Rounding EffortsRounding Efforts

EE is the is the expectation of the rounding error

XX is the error is the error caused by rounding q is the quantizing steps

PP is pdf function is pdf function

σ σ is is power spectral density

of the rounding error

1717

Adaptive Algorithms, Finite Precision Effects on Adaptive Adaptive Algorithms, Finite Precision Effects on Adaptive Algorithms (2) -- Algorithms (2) -- Truncatoin EffortsTruncatoin Efforts

EE is the is the expectation of the truncation error

XX is the error is the error caused by truncation

q is the quantizing steps

PP is pdf function is pdf function

σ σ is is power spectral density

of the truncation error

1818

Adaptive Algorithms, Finite Precision Effects on Adaptive Adaptive Algorithms, Finite Precision Effects on Adaptive Algorithms (3) -- Algorithms (3) -- Rounding Efforts on LMS filterRounding Efforts on LMS filter

Input Quantization Effects (AD) (AD)

ε(nT) is the quantization noiseArithmetic Rounding Effects Product Rounding Effects Coefficient Rounding Effects Rounding Effects at the Adaptation StageEffects Rounding at the Convolution Stage

LMS filter Slowdown and Stalling Saturation

Using the clamping technique in which upon detecting saturation, the result is “clamped” to the most positive or most negative number,depending on the sign bit.Alternatively, the sign algorithm is anotherway to reduce/avoid stalling.

1919

Implementation of an Integer Based Adaptive Noise Implementation of an Integer Based Adaptive Noise Canceller in Stratix Devices (1)Canceller in Stratix Devices (1)

---- Software SimulationSoftware Simulation

The sampled desired discrete signal, composed of both the speaker’s speech and the vacuum noise, is served as the Noise Canceller’s reference signal; another vacuum noise, also sampled, is served as the filter’s primary input signal. Upon processing, the vacuum noise will be reduced due to the adaptation of the filter tap weights. And the error signal produced by the adaptive system is in close resemblance of the original speech.

2020


---- Software Simulation ResultsSoftware Simulation Results

Since the primary and reference signal quantization is unavoidable due to A/D conversion, the only source of error that can be controlled by the designer is then product quantization noise at both the convolution stage and the adaptation stage.

2121


---- Hardware ImplementationHardware Implementation

The newest FPGA families, Altera’s Stratix device family for example, incorporates embedded DSP blocks within the FPGA chip to have dedicated circuitry to perform common DSP operations including multiply and accumulate. This family of FPGA devices is compared with another family of FPGA devices that does not include embedded DSP blocks. DSP applications including adaptive systems have traditionally been implemented using general-purpose DSP processors due to their ability to perform fast arithmetic operations.

2222


---- Hardware ImplementationHardware Implementation

Hardware Block DiagramHardware Block Diagram

2323

Conclusion (1)Conclusion (1)

----Stratix vs Traditional FPGAs Speed and Area ComparisonSpeed and Area ComparisonStratix—with on-chip DSP componentsStratix—with on-chip DSP components

APEX—traditional FPGA without DSP componentsAPEX—traditional FPGA without DSP components

2424

Conclusion (2)Conclusion (2)

-- -- FPGAs vs DSP Processors Power Consumption ComparisonPower Consumption ComparisonStratix – with on-chip DSP componentsStratix – with on-chip DSP components

TMS320VC33, DSP56390 – Traditional DSP devicesTMS320VC33, DSP56390 – Traditional DSP devices

2525

Fast Discrete Wavelet Transformation Fast Discrete Wavelet Transformation Using FPGAs and Distributed Using FPGAs and Distributed

ArithmeticArithmetic

Ali M. Al-Haj, Ali M. Al-Haj,

Department of Electronics Engineering,Department of Electronics Engineering,

Princess Sumaya University for Technology, Princess Sumaya University for Technology,

Al-Jubeiha P.O. Box 1438, Al-Jubeiha P.O. Box 1438,

Amman 11941, Jordan, 2003Amman 11941, Jordan, 2003

2626

Problem Statement and Purpose of the DesignProblem Statement and Purpose of the Design

programming such multiprocessor systems is a tedious, difficult, and time consuming task.multiprocessor implementations of the discrete wavelet transform are not cost effective since parallelism comes at the expense of augmenting the system with more processing engines operating in parallel.Custom VLSI circuits are inherently inflexible and their development is costly and time consuming, and thus they are not an attractive option for implementing the wavelet transformFPGAs maintain the advantages of the custom functionality of VLSI ASIC devices, while avoiding the high development costs and the inability to make design modifications after production. Furthermore, FPGAs inherit design flexibility and adaptability of software implementations.Our discrete wavelet transform implementation is exploiting the natural match between the Virtex architecture and distributed arithmetic

2727

Basic Wavelet ComputationBasic Wavelet Computation

System diagram and wavelet coefficientsSystem diagram and wavelet coefficients

2828

Distributed Arithmetic & Virtex FPGAs (1)Distributed Arithmetic & Virtex FPGAs (1)-- Distributed Arithmetic-- Distributed Arithmetic

Let the variable Y hold the result of an inner product operation between a data vector x and a coefficient vector a. The conventional representation the inner product operation is given as follows:

Where the input data words xi have been represented by the 2’s complement number presentation in order to bound number growth under multiplication. The variable xij is the jth bit of the xi word which is Boolean, B is the number of bits of each input data word and x0i is the sign bit.

2929

Distributed Arithmetic & Virtex FPGAs (1)Distributed Arithmetic & Virtex FPGAs (1)-- Distributed Arithmetic-- Distributed Arithmetic

Distributed Arithmetic implemented in FPGADistributed Arithmetic implemented in FPGA

3030

Distributed arithmetic implementationDistributed arithmetic implementation

Distributed Arithmetic Filter implemented in FPGADistributed Arithmetic Filter implemented in FPGA

3131

Functional simulationFunctional simulation

Forward and Inverse DWT function simulationForward and Inverse DWT function simulation

3232

Performance evaluation (1)Performance evaluation (1)

Speed comparisonSpeed comparison between conventional arithmetic implementation between conventional arithmetic implementation and distributed arithmetic implementationand distributed arithmetic implementation

3333

Performance evaluation (2)Performance evaluation (2)

Resource usage comparisonResource usage comparison between conventional arithmetic between conventional arithmetic implementation and distributed arithmetic implementationimplementation and distributed arithmetic implementation

3434

Conclusion and Further Work (1)Conclusion and Further Work (1)-- Conclusions-- Conclusions

Two Implementations using the highly parallel Virtex filed programmable gate array devices (FPGAs), and two software implementations; one using the TMS320C6711 digital signal processor and the other using the 800 MHz Pentium III Intel processor.Implementation which was based on the distributed arithmetic algorithm achieved the best performance results.Two software implementations were far inferior to the FPGA implementations in terms of execution speed.The TMS320C6711 digital signal processor performed much better than the Pentium III , however, its performance is still much lower the performance of the least efficient, direct FPGA implementationUsing FPGAs, coupled with reformulating the computation of the wavelet transform in accordance with the distributed arithmetic algorithm, results in the performance levels required for real-time implementations.

3535

Conclusion and Further Work (2)Conclusion and Further Work (2)-- Further Work-- Further Work

After completing this FPGA implementation of the discrete wavelet transform and its inverse, we are now working on integrating a whole wavelet-based image compression system on a single, dynamic, runtime reconfigurable FPGA.A typical image compression system consists of an encoder and a decoder. At the encoder side, an image is first transformed to the frequency domain using the forward discrete wavelet transform.The non-negligible wavelet coefficients are then quantized, and finally encoded using an appropriate entropy encoder. The decoder side reverses the whole encoding procedure described above.Transforming the 2-D image data can be done simply by inserting a matrix transpose module between two 1-D discrete wavelet transform modules such as those described in this paper.

3737

Image Compression using Wavelet Image Compression using Wavelet Filter Bank on Reconfigurable Filter Bank on Reconfigurable

Computing SystemComputing System

Oliver LiuOliver LiuENGG*6090 : Project of Reconfigurable ENGG*6090 : Project of Reconfigurable

Computing SystemsComputing SystemsWinter 2007Winter 2007

3838

OutlineOutline

Problem Statement and Purpose of the Design Problem Statement and Purpose of the Design Experiment Environment Experiment Environment

Transform and Coding AlgorithmsTransform and Coding Algorithms

Software ImplementationSoftware Implementation

SW/HW implementation (on going)SW/HW implementation (on going)

Hardware Implementation (on going)Hardware Implementation (on going)

ResultsResults

ConclusionConclusion

3939

Problem Statement and Purpose of the Design (1)Problem Statement and Purpose of the Design (1)--Introduction--Introduction

A typical image compression system consists of an A typical image compression system consists of an encoder and a decoder. At the encoder side, an image is encoder and a decoder. At the encoder side, an image is first transformed to the frequency domain using the first transformed to the frequency domain using the forward discrete wavelet transform.forward discrete wavelet transform.The non-negligible wavelet coefficients are then The non-negligible wavelet coefficients are then quantized, and finally encoded using an appropriate quantized, and finally encoded using an appropriate entropy encoder. The decoder side reverses the whole entropy encoder. The decoder side reverses the whole encoding procedure described above.encoding procedure described above.An image compression system will be implemented An image compression system will be implemented using Reconfigurable Computing Platform.using Reconfigurable Computing Platform.

4040

Problem Statement and Purpose of the Design (2)Problem Statement and Purpose of the Design (2)-- System Diagram-- System Diagram

ForwardWaveletFilter Bank

HP Sub-Image Huffman Coding

LP Sub Image Run-length Coding

Huffman decoding

Run-length decoding

HP Sub-Image

LP Sub Image

BackwardWaveletFilter Bank

4141

Problem Statement and Purpose of the Design (3)Problem Statement and Purpose of the Design (3)-- Problem Definition-- Problem Definition

One implementation is to One implementation is to implement the transforming, implement the transforming, quantization and codingquantization and coding all in softwareall in software and run them on and run them on a microprocessor on a FPGA.a microprocessor on a FPGA.Other implementations will put either Other implementations will put either one or all of the one or all of the transforming, quantization or coding to hardwaretransforming, quantization or coding to hardware and and rest of them run on a microprocessor on the FPGA.rest of them run on a microprocessor on the FPGA.A RTOS will be used to observe the performance of A RTOS will be used to observe the performance of different implementations controlled by different implementations controlled by multi-processesmulti-processes..

4242

Xilinx Multimedia BoardXilinx Multimedia Board

The on-board The on-board Xilinx Vertex-II xc2v200Xilinx Vertex-II xc2v200 is is used to implement different architecture.used to implement different architecture.The on-board The on-board external 2M memoryexternal 2M memory will be will be used to store compressed and used to store compressed and decompressed images and original image.decompressed images and original image.The The MFS file systemMFS file system is being used to store is being used to store image files.image files.Xilinx real time operation system kernel Xilinx real time operation system kernel xikernelxikernel is being used in this design. is being used in this design.

4343

Transform and Coding AlgorithmsTransform and Coding Algorithms(1) -- Wavelet Filter Bank(1) -- Wavelet Filter Bank

System diagram and wavelet coefficientsSystem diagram and wavelet coefficients

4444

Transform and Coding AlgorithmsTransform and Coding Algorithms(2) – Huffman Coding(2) – Huffman Coding

Length Code Source Probability

2

2

3

3

3

4

4

11

10

011

101

001

0001

0000

a1

a2

a3

a4

a5

a6

a7

0.20

0.19

0.17

0.15

0.10

0.01

0.18

0

10

1

0

10

1

1

1

0

0

4545

Transform and Coding AlgorithmsTransform and Coding Algorithms(3) – Run Length Coding (RLC)(3) – Run Length Coding (RLC)

Consider a character run of 15 'A' characters which normally would require15 bytes to store :

AAAAAAAAAAAAAAA

With RLE, this would only require two bytes to store, the count (15) is stored as the first byte and the symbol (A) as the second byte.

15A

4646

Software ImplementationSoftware Implementation

4747

Hardware ImplementationHardware Implementation

4848

Thank YouThank You

Questions ?Questions ?