11
Digital Signal Processing Digital Signal Processing on Reconfigurable on Reconfigurable
Computing SystemsComputing Systems Oliver LiuOliver Liu
ENGG*6090 : Reconfigurable Computing ENGG*6090 : Reconfigurable Computing SystemsSystems
Winter 2007Winter 2007
22
ReferencesReferences
Reconfigurable Computing: Accelerating Computation Reconfigurable Computing: Accelerating Computation with Field- Programmable-Gate-Array, with Field- Programmable-Gate-Array, Chapter 5. By Maya B. GokhaleChapter 5. By Maya B. GokhaleThe Design Warrior’s Guide to FPGAs, Chapter 12. By The Design Warrior’s Guide to FPGAs, Chapter 12. By C. Maxfeild.C. Maxfeild. Andrew Y. Lin, Implementation Consideration for FPGA-Andrew Y. Lin, Implementation Consideration for FPGA-Based Adaptive Transversal Filter Design. Master Based Adaptive Transversal Filter Design. Master Thesis, University of Florida,2003Thesis, University of Florida,2003Ali M. Al-Haj, Fast Discrete Wavelet Transformation Ali M. Al-Haj, Fast Discrete Wavelet Transformation Using FPGAs and Distributed Arithmetic. Department of Using FPGAs and Distributed Arithmetic. Department of Electronics Engineering, Princess Sumaya University for Electronics Engineering, Princess Sumaya University for Technology, Al-Jubeiha P.O. Box 1438, Amman 11941, Technology, Al-Jubeiha P.O. Box 1438, Amman 11941, Jordan, 2003Jordan, 2003
33
IntroductionIntroduction
Why Use Reconfigurable Computing for DSP?Why Use Reconfigurable Computing for DSP?Advantage and disadvantage of RC for DSP. Advantage and disadvantage of RC for DSP.
Explorations in parallel DSP processing in FPGAExplorations in parallel DSP processing in FPGA.. Some basic DSP Application Building BlocksSome basic DSP Application Building Blocks
MAC, Multiply-Accumulate Unit for DSP.MAC, Multiply-Accumulate Unit for DSP.
Bit_serial Adder, Parallel Distributed Arithmetic Multiplier.Bit_serial Adder, Parallel Distributed Arithmetic Multiplier.
DSP components.DSP components. Some FPGA Centric DSP Design ToolsSome FPGA Centric DSP Design Tools
Assembly, C/C++, Handle-C, RTL, Xilinx Core Generator, Assembly, C/C++, Handle-C, RTL, Xilinx Core Generator, Xilinx Core Generator, MAtlab/Simulink, Xilinx System Xilinx Core Generator, MAtlab/Simulink, Xilinx System GeneratorGenerator
44
Advantage and disadvantage of RC for DSP (1)Advantage and disadvantage of RC for DSP (1)
Technology Performance Cost Power Flexibility Memory BW I/O BW
GPP
PDSP
ASIC
FPGA
LOW
Med-High
HIGH
Medium
LOW
Medium
HIGH
Low
HIGH
Medium
LOW
Low-Medium
HIGH
Medium
LOW
HIGH
LOW
Medium
HIGH
HIGH
LOW
LOW
HIGH
HIGH
55
Advantage and disadvantage of RC for DSP (2)Advantage and disadvantage of RC for DSP (2)
AdvantagesAdvantages Parallel processingParallel processing capability achieve high performance. capability achieve high performance. flexible architecture flexible architecture reduce the riskreduce the risk of product development. of product development. Design can be changed Design can be changed during the evolution of the productduring the evolution of the product Word widthsWord widths can be flexible. can be flexible. Lower powerLower power than DSP. than DSP. Price Price is becoming lower.is becoming lower.
DisadvantagesDisadvantages Power consumption and performance is Power consumption and performance is lower than ASIClower than ASIC..
66
Explorations in parallel DSP processing in Explorations in parallel DSP processing in Reconfigurable Computing System (1)Reconfigurable Computing System (1)
a(0)
Data Out
a(1) a(2) a(3)
Data In
a(0)
Data Out
a(1) a(2) a(3)
Data In
Reg0 Reg1 Reg2 Reg3
Reg Reg Reg Reg
Reg Reg Reg Reg
Reg Reg
77
Explorations in parallelism DSP processing Explorations in parallelism DSP processing in Reconfigurable Computing System (2)in Reconfigurable Computing System (2)
Most DSP applications require several operations such Most DSP applications require several operations such as FIR filters, transforms, etc. to process each incoming as FIR filters, transforms, etc. to process each incoming data stream, providing the potential to exploit data stream, providing the potential to exploit coarse-coarse-grained parallelism in FPGA.grained parallelism in FPGA.DSP applications often use fixed coefficients or DSP applications often use fixed coefficients or constants throughout their applications. By “folding” the constants throughout their applications. By “folding” the constants directly into hardware, i.e., customizing the constants directly into hardware, i.e., customizing the hardware for giving constant, the hardware for giving constant, the area and speedarea and speed of of operations can be significantly improved.operations can be significantly improved.Reconfigurable computing’s ability to supply both flexible Reconfigurable computing’s ability to supply both flexible and significant and significant memory bandwidthmemory bandwidth also improves the also improves the possible parallelism that can be extracted in DSP possible parallelism that can be extracted in DSP applications.applications.
88
Some DSP Application Building Blocks (1)Some DSP Application Building Blocks (1)
The most commonly used DSP functions are The most commonly used DSP functions are FIR (Finite Impulse response) filters,FIR (Finite Impulse response) filters, IIR (Infinite Impulse response) filters, IIR (Infinite Impulse response) filters, FFT (Fast Fourier Transform), FFT (Fast Fourier Transform), DCT (Direct Cosine Transform),DCT (Direct Cosine Transform), Encoder/Decoder and Error Correction/Detection Encoder/Decoder and Error Correction/Detection
functions.functions.
All of these blocks perform intensive arithmetic All of these blocks perform intensive arithmetic operations such as operations such as add, subtract, multiply, multiply-add or multiply-add, subtract, multiply, multiply-add or multiply-
accumulate.accumulate.
99
Some DSP Application Building Blocks (2)Some DSP Application Building Blocks (2)
A
x
+
x
+
A[n:0]
B[n:0] Y[(2n - 1):0]
Multiplier
Adder
Accumulator
MAC
MAC unit
B
Sum
Bit-Serial Adder unit
Q D
Clr Clk
1010
Some DSP Application Building Blocks (3)Some DSP Application Building Blocks (3)
ROM16x12 bits
ROM16x12 bits
12 bit Adder
Input[7:0]
UPP[11:0] LPP[11:0]
Sum Sum
Input[7:4] Input[3:0]
Addr[3:0] Addr[3:0]
LPP[3:0]
8-bit by 8-bit Parallel Distributed Arithmetic Multiplier
1111
Some DSP Application Building Blocks (4)Some DSP Application Building Blocks (4)
Efficient Memory Structures (Efficient Memory Structures (LUTsLUTs))
Filters - Filters - IIR, FIR, LMS, etcIIR, FIR, LMS, etc..
Fast Fourier Transforms (Fast Fourier Transforms (FFTFFT))
Discrete Cosine Transform (Discrete Cosine Transform (DCTDCT))
Discrete Wavelet Transform (Discrete Wavelet Transform (DWTDWT))
1212
Some FPGA Centric DSP Design Tools and Some FPGA Centric DSP Design Tools and LanguagesLanguages
Assembly, C, C++Assembly, C, C++
VHDL/Verilog (RTL code)VHDL/Verilog (RTL code)
Xilinx EDKXilinx EDK
Xilinx ISEXilinx ISE
Mentor Graphic ModelsimMentor Graphic Modelsim
Xilinx Core GeneratorXilinx Core Generator
MATLAB/SimulinkMATLAB/Simulink
Xilinx System GeneratorXilinx System Generator
1313
Topics CoveredTopics Covered
Implementation Consideration for FPGA-Implementation Consideration for FPGA-Based Adaptive Transversal Filter Design.Based Adaptive Transversal Filter Design.
Fast Discrete Wavelet Transformation Fast Discrete Wavelet Transformation Using FPGAs and Distributed ArithmeticUsing FPGAs and Distributed Arithmetic
ENG6090 Project Status: Image ENG6090 Project Status: Image Compression using Wavelet Filter Bank on Compression using Wavelet Filter Bank on Reconfigurable Computing SystemReconfigurable Computing System
1414
Implementation Consideration for Implementation Consideration for FPGA-Based Adaptive Transversal FPGA-Based Adaptive Transversal
Filter DesignFilter Design
Andrew Y. Lin, Andrew Y. Lin,
Master Thesis, University of Florida, 2003Master Thesis, University of Florida, 2003
1515
Problem Statement and Purpose of the Design Problem Statement and Purpose of the Design
Due to finite precisions in digital hardware, quantization must be performed in either or all of the following areas:
• Input and reference signals;
• Product quantization in convolution stage;
• Coefficient quantization in adaptation stage.
Quantization noise is introduced in all of the above areas. The effects of quantization are discussed in this
thesis. This thesis also investigates the performance among FGPAs and DSP processors in terms of speed and power consumption.
1616
Adaptive Algorithms, Finite Precision Effects on Adaptive Adaptive Algorithms, Finite Precision Effects on Adaptive Algorithms (1) -- Algorithms (1) -- Rounding EffortsRounding Efforts
EE is the is the expectation of the rounding error
XX is the error is the error caused by rounding q is the quantizing steps
PP is pdf function is pdf function
σ σ is is power spectral density
of the rounding error
1717
Adaptive Algorithms, Finite Precision Effects on Adaptive Adaptive Algorithms, Finite Precision Effects on Adaptive Algorithms (2) -- Algorithms (2) -- Truncatoin EffortsTruncatoin Efforts
EE is the is the expectation of the truncation error
XX is the error is the error caused by truncation
q is the quantizing steps
PP is pdf function is pdf function
σ σ is is power spectral density
of the truncation error
1818
Adaptive Algorithms, Finite Precision Effects on Adaptive Adaptive Algorithms, Finite Precision Effects on Adaptive Algorithms (3) -- Algorithms (3) -- Rounding Efforts on LMS filterRounding Efforts on LMS filter
Input Quantization Effects (AD) (AD)
ε(nT) is the quantization noiseArithmetic Rounding Effects Product Rounding Effects Coefficient Rounding Effects Rounding Effects at the Adaptation StageEffects Rounding at the Convolution Stage
LMS filter Slowdown and Stalling Saturation
Using the clamping technique in which upon detecting saturation, the result is “clamped” to the most positive or most negative number,depending on the sign bit.Alternatively, the sign algorithm is anotherway to reduce/avoid stalling.
1919
Implementation of an Integer Based Adaptive Noise Implementation of an Integer Based Adaptive Noise Canceller in Stratix Devices (1)Canceller in Stratix Devices (1)
---- Software SimulationSoftware Simulation
The sampled desired discrete signal, composed of both the speaker’s speech and the vacuum noise, is served as the Noise Canceller’s reference signal; another vacuum noise, also sampled, is served as the filter’s primary input signal. Upon processing, the vacuum noise will be reduced due to the adaptation of the filter tap weights. And the error signal produced by the adaptive system is in close resemblance of the original speech.
2020
Implementation of an Integer Based Adaptive Noise Implementation of an Integer Based Adaptive Noise Canceller in Stratix Devices (2)Canceller in Stratix Devices (2)
---- Software Simulation ResultsSoftware Simulation Results
Since the primary and reference signal quantization is unavoidable due to A/D conversion, the only source of error that can be controlled by the designer is then product quantization noise at both the convolution stage and the adaptation stage.
2121
Implementation of an Integer Based Adaptive Noise Implementation of an Integer Based Adaptive Noise Canceller in Stratix Devices (3)Canceller in Stratix Devices (3)
---- Hardware ImplementationHardware Implementation
The newest FPGA families, Altera’s Stratix device family for example, incorporates embedded DSP blocks within the FPGA chip to have dedicated circuitry to perform common DSP operations including multiply and accumulate. This family of FPGA devices is compared with another family of FPGA devices that does not include embedded DSP blocks. DSP applications including adaptive systems have traditionally been implemented using general-purpose DSP processors due to their ability to perform fast arithmetic operations.
2222
Implementation of an Integer Based Adaptive Noise Implementation of an Integer Based Adaptive Noise Canceller in Stratix Devices (4)Canceller in Stratix Devices (4)
---- Hardware ImplementationHardware Implementation
Hardware Block DiagramHardware Block Diagram
2323
Conclusion (1)Conclusion (1)
----Stratix vs Traditional FPGAs Speed and Area ComparisonSpeed and Area ComparisonStratix—with on-chip DSP componentsStratix—with on-chip DSP components
APEX—traditional FPGA without DSP componentsAPEX—traditional FPGA without DSP components
2424
Conclusion (2)Conclusion (2)
-- -- FPGAs vs DSP Processors Power Consumption ComparisonPower Consumption ComparisonStratix – with on-chip DSP componentsStratix – with on-chip DSP components
TMS320VC33, DSP56390 – Traditional DSP devicesTMS320VC33, DSP56390 – Traditional DSP devices
2525
Fast Discrete Wavelet Transformation Fast Discrete Wavelet Transformation Using FPGAs and Distributed Using FPGAs and Distributed
ArithmeticArithmetic
Ali M. Al-Haj, Ali M. Al-Haj,
Department of Electronics Engineering,Department of Electronics Engineering,
Princess Sumaya University for Technology, Princess Sumaya University for Technology,
Al-Jubeiha P.O. Box 1438, Al-Jubeiha P.O. Box 1438,
Amman 11941, Jordan, 2003Amman 11941, Jordan, 2003
2626
Problem Statement and Purpose of the DesignProblem Statement and Purpose of the Design
programming such multiprocessor systems is a tedious, difficult, and time consuming task.multiprocessor implementations of the discrete wavelet transform are not cost effective since parallelism comes at the expense of augmenting the system with more processing engines operating in parallel.Custom VLSI circuits are inherently inflexible and their development is costly and time consuming, and thus they are not an attractive option for implementing the wavelet transformFPGAs maintain the advantages of the custom functionality of VLSI ASIC devices, while avoiding the high development costs and the inability to make design modifications after production. Furthermore, FPGAs inherit design flexibility and adaptability of software implementations.Our discrete wavelet transform implementation is exploiting the natural match between the Virtex architecture and distributed arithmetic
2727
Basic Wavelet ComputationBasic Wavelet Computation
System diagram and wavelet coefficientsSystem diagram and wavelet coefficients
2828
Distributed Arithmetic & Virtex FPGAs (1)Distributed Arithmetic & Virtex FPGAs (1)-- Distributed Arithmetic-- Distributed Arithmetic
Let the variable Y hold the result of an inner product operation between a data vector x and a coefficient vector a. The conventional representation the inner product operation is given as follows:
Where the input data words xi have been represented by the 2’s complement number presentation in order to bound number growth under multiplication. The variable xij is the jth bit of the xi word which is Boolean, B is the number of bits of each input data word and x0i is the sign bit.
2929
Distributed Arithmetic & Virtex FPGAs (1)Distributed Arithmetic & Virtex FPGAs (1)-- Distributed Arithmetic-- Distributed Arithmetic
Distributed Arithmetic implemented in FPGADistributed Arithmetic implemented in FPGA
3030
Distributed arithmetic implementationDistributed arithmetic implementation
Distributed Arithmetic Filter implemented in FPGADistributed Arithmetic Filter implemented in FPGA
3131
Functional simulationFunctional simulation
Forward and Inverse DWT function simulationForward and Inverse DWT function simulation
3232
Performance evaluation (1)Performance evaluation (1)
Speed comparisonSpeed comparison between conventional arithmetic implementation between conventional arithmetic implementation and distributed arithmetic implementationand distributed arithmetic implementation
3333
Performance evaluation (2)Performance evaluation (2)
Resource usage comparisonResource usage comparison between conventional arithmetic between conventional arithmetic implementation and distributed arithmetic implementationimplementation and distributed arithmetic implementation
3434
Conclusion and Further Work (1)Conclusion and Further Work (1)-- Conclusions-- Conclusions
Two Implementations using the highly parallel Virtex filed programmable gate array devices (FPGAs), and two software implementations; one using the TMS320C6711 digital signal processor and the other using the 800 MHz Pentium III Intel processor.Implementation which was based on the distributed arithmetic algorithm achieved the best performance results.Two software implementations were far inferior to the FPGA implementations in terms of execution speed.The TMS320C6711 digital signal processor performed much better than the Pentium III , however, its performance is still much lower the performance of the least efficient, direct FPGA implementationUsing FPGAs, coupled with reformulating the computation of the wavelet transform in accordance with the distributed arithmetic algorithm, results in the performance levels required for real-time implementations.
3535
Conclusion and Further Work (2)Conclusion and Further Work (2)-- Further Work-- Further Work
After completing this FPGA implementation of the discrete wavelet transform and its inverse, we are now working on integrating a whole wavelet-based image compression system on a single, dynamic, runtime reconfigurable FPGA.A typical image compression system consists of an encoder and a decoder. At the encoder side, an image is first transformed to the frequency domain using the forward discrete wavelet transform.The non-negligible wavelet coefficients are then quantized, and finally encoded using an appropriate entropy encoder. The decoder side reverses the whole encoding procedure described above.Transforming the 2-D image data can be done simply by inserting a matrix transpose module between two 1-D discrete wavelet transform modules such as those described in this paper.
3636
3737
Image Compression using Wavelet Image Compression using Wavelet Filter Bank on Reconfigurable Filter Bank on Reconfigurable
Computing SystemComputing System
Oliver LiuOliver LiuENGG*6090 : Project of Reconfigurable ENGG*6090 : Project of Reconfigurable
Computing SystemsComputing SystemsWinter 2007Winter 2007
3838
OutlineOutline
Problem Statement and Purpose of the Design Problem Statement and Purpose of the Design Experiment Environment Experiment Environment
Transform and Coding AlgorithmsTransform and Coding Algorithms
Software ImplementationSoftware Implementation
SW/HW implementation (on going)SW/HW implementation (on going)
Hardware Implementation (on going)Hardware Implementation (on going)
ResultsResults
ConclusionConclusion
3939
Problem Statement and Purpose of the Design (1)Problem Statement and Purpose of the Design (1)--Introduction--Introduction
A typical image compression system consists of an A typical image compression system consists of an encoder and a decoder. At the encoder side, an image is encoder and a decoder. At the encoder side, an image is first transformed to the frequency domain using the first transformed to the frequency domain using the forward discrete wavelet transform.forward discrete wavelet transform.The non-negligible wavelet coefficients are then The non-negligible wavelet coefficients are then quantized, and finally encoded using an appropriate quantized, and finally encoded using an appropriate entropy encoder. The decoder side reverses the whole entropy encoder. The decoder side reverses the whole encoding procedure described above.encoding procedure described above.An image compression system will be implemented An image compression system will be implemented using Reconfigurable Computing Platform.using Reconfigurable Computing Platform.
4040
Problem Statement and Purpose of the Design (2)Problem Statement and Purpose of the Design (2)-- System Diagram-- System Diagram
ForwardWaveletFilter Bank
HP Sub-Image Huffman Coding
LP Sub Image Run-length Coding
Huffman decoding
Run-length decoding
HP Sub-Image
LP Sub Image
BackwardWaveletFilter Bank
4141
Problem Statement and Purpose of the Design (3)Problem Statement and Purpose of the Design (3)-- Problem Definition-- Problem Definition
One implementation is to One implementation is to implement the transforming, implement the transforming, quantization and codingquantization and coding all in softwareall in software and run them on and run them on a microprocessor on a FPGA.a microprocessor on a FPGA.Other implementations will put either Other implementations will put either one or all of the one or all of the transforming, quantization or coding to hardwaretransforming, quantization or coding to hardware and and rest of them run on a microprocessor on the FPGA.rest of them run on a microprocessor on the FPGA.A RTOS will be used to observe the performance of A RTOS will be used to observe the performance of different implementations controlled by different implementations controlled by multi-processesmulti-processes..
4242
Xilinx Multimedia BoardXilinx Multimedia Board
The on-board The on-board Xilinx Vertex-II xc2v200Xilinx Vertex-II xc2v200 is is used to implement different architecture.used to implement different architecture.The on-board The on-board external 2M memoryexternal 2M memory will be will be used to store compressed and used to store compressed and decompressed images and original image.decompressed images and original image.The The MFS file systemMFS file system is being used to store is being used to store image files.image files.Xilinx real time operation system kernel Xilinx real time operation system kernel xikernelxikernel is being used in this design. is being used in this design.
4343
Transform and Coding AlgorithmsTransform and Coding Algorithms(1) -- Wavelet Filter Bank(1) -- Wavelet Filter Bank
System diagram and wavelet coefficientsSystem diagram and wavelet coefficients
4444
Transform and Coding AlgorithmsTransform and Coding Algorithms(2) – Huffman Coding(2) – Huffman Coding
Length Code Source Probability
2
2
3
3
3
4
4
11
10
011
101
001
0001
0000
a1
a2
a3
a4
a5
a6
a7
0.20
0.19
0.17
0.15
0.10
0.01
0.18
0
10
1
0
10
1
1
1
0
0
4545
Transform and Coding AlgorithmsTransform and Coding Algorithms(3) – Run Length Coding (RLC)(3) – Run Length Coding (RLC)
Consider a character run of 15 'A' characters which normally would require15 bytes to store :
AAAAAAAAAAAAAAA
With RLE, this would only require two bytes to store, the count (15) is stored as the first byte and the symbol (A) as the second byte.
15A
4646
Software ImplementationSoftware Implementation
4747
Hardware ImplementationHardware Implementation
4848
Thank YouThank You
Questions ?Questions ?
Top Related