CHAPTER-4shodhganga.inflibnet.ac.in/bitstream/10603/13480/11/11_chapter 4.pdfnot just for small...
Transcript of CHAPTER-4shodhganga.inflibnet.ac.in/bitstream/10603/13480/11/11_chapter 4.pdfnot just for small...
87
.
CHAPTER-4 Coprocessor based test case and discussion of results
In this chapter we begin to integrate solutions for
SEU, MBU, SCA applied to FPGAs, and test them
not just for small circuits, but at component, and
SOC level. A design of Variable length Precision
Arithmetic co-processor with a capability for
changing the precision using reconfiguration based
on application requirements is discussed. Later on,
FPU Co-processor is designed and interfaced
through AMBA bus to LEON3 Core for multiple co-
processor interfaces to create a system level test pad.
A configurable and scalable multi-processor system
is demonstrated on Xilinx FPGA with and without
fault tolerant & security features. The performance
of single-processor and dual-processor
implementations of widely used DIT and DIF radix-
2 N-point FFT algorithms on that system is analyzed
to test the practical implementation issue of “unified
scalable” fault tolerant and security architecture
proposed as Re-PAM-DSP in this thesis.
88
CHAPTER FOUR: Coprocessor based testcase and discussion of results
4.1 Introduction
The need for low power, high performance computation but without the extremely high
design costs for ASIC designs, have driven a number of designers to create a flexible,
universal platforms. Since design Non-Recurring (NRE) costs of a rad-hard ASIC are
many order of magnitude larger than fabrication, users prefers to configure/ programs a
flexible computing platform framework to run their applications with the desired
performance, especially for the extremely low volume, or even unique mission-critical
application for each satellite.
A better alternative for high bandwidth computation is to use FPGAs as a coprocessor
[217] that integrates the repetitive speed-critical portion of algorithms into the FPGAs.
In many applications, the data comes at high data rates such as 50MSPS and some sensor
nodes data rates can be much smaller as 50KSPS in either way, the processing chain is
complex. DSPs do not have the bandwidth necessary to meet the real time deadline and
the processing must be dealt offline.
The difficulty with DSPs in these applications is that DSPs are serial machines and in
some high speed DSPs, specific instructions are processed in parallel. Here it is
optimized using software and special hardware available in specific architecture. So they
all are machine dependent and the same code cannot be reused for other platform.
These issues can be solved by taking advantages of the ability to convert cascaded set of
operations into a parallel structure that operates in several clock cycles at high frequency.
In most DSP applications, the designers have few options to increase performance
beyond optimizing by using specialized DSP assembly language instructions or
upgrading DSP processor. With FPGAs approach, the hardware, as well as software can
be simultaneously optimized, moreover the designer can change the partition of the
systems, modify the algorithm, update the hardware in runtime for different
functionalities, control the hardware of the systems.
89
Other part is that, processing the element of FPGAs is not just used for prototyping
custom hardware modules but also for parallel processing, by implementing multiple
processors for a single task/ multiple tasks.
Multiple processor hardware accelerators enable, to distribute tasks of applications to
several microprocessors in order to exploit parallelism for accelerating the performance
of computation; especially in the application flow of image processing, where
computation performance is a crucial factor to keep real time requirements. The DSP
applications are promising areas for exploiting co-processors.
Here three different types of co-processor platforms are addressed for performance
improvement:
1. Variable length Precision Arithmetic co-processor is designed and which has
capability for changing precision and functionality using partial run time
reconfiguration.
2. FPU Co-processor is designed and interfaced via direct coupling and also through
AMBA bus for multiple processor interface.
3. Performance by multiprocessor platform: Here Xilinx Microbalze softcore
processor used for demonstrating parallel processing. The interface among the
processor is carried out by Linear array using FSL (Fast Simplex Link). The FFT
algorithms are executed on the processor core benchmark and the speed up is to
be analyzed.
4.2 Detailed Problem Definition
FPGA is a way out for low volume custom design, more common for mission critical
applications, including space applications. This has lead to custom solutions where
VIRTEX series FPGA from Xilinx are quite popular. But there is need for unified
platform independent FPGA architectures, not just for design, but integrating fault
tolerant, reliability and robust security solutions. The issue extends to a viable means for
test the integration of various piecemeal solutions into a singular whole.
90
Current literatures, to evaluate secure and fault tolerant FPGA architectures use shift
registers, small combinatorial circuits and small memory arrays. A large body of
published research uses [3-6, 8-9, 14, 26-27] circuit blocks like shift registers, small
memory arrays or combinatorial cells as test cases. The robustness of such FPGA
architecture needs to be tested for larger test cases that include ALUs, MACs, and
medium sized co-processor implementations. Selection of small test cases makes sense
for fault tolerance solutions that rely on clever circuit insertions, or creating the correct
hooks for detecting / correcting SEU faults. This is a limiting scenario that does not test
the scalability of the proposal for various sizes and types of applications. It especially
leaves an intangible but real danger of difficulty in integrating the solution proposals at
system level. A robust solution needs to be seamlessly viable from smallest level of
circuit detail to large system level, and yet be compatible to practical design phases for
many types of applications in the industry.
Significant literature is published over the years to resolve drawbacks of each generation
of FPGA. Prominent among them is proposal, and eventual deployment of an Internal
Configuration Access Port (ICAP) [54, 57, 58] for scrubbing of SEU errors. Although
ICAP has many advantages, the greatest limitation is its proprietary applications for
FPGA solutions rolled out by Xilinx Corp. A comprehensive test of scalability in size of
application, type of applications, to various proprietary FPGA platforms, and yet be
modular enough in integrating all the features, to choose the level of fault tolerance and
security is required for any serious proposal.
4.3 Method of Solution
Testing of the Re-PAM-DSP architecture proposal had to cover design and validation
experience for a real life application. The endeavour is to integrate solutions for SEU,
MBU, SCA applied to FPGAs, and test them not just for small circuits, but at component,
and SOC level. It follows with the co-processor/accelerator interface with SoC. Variable
Length Arithmetic co-processor is designed using partial reconfirmation. A design of
91
Variable length Precision Arithmetic co-processor with a capability for changing the
precision using reconfiguration based on application requirements is discussed. A
scalable design is proposed to improve performance by incorporating multiple co-
processors. Later on FPU Co-processor is designed and interfaced through AMBA bus to
LEON3 Core for multiple co- processor interfaces to create a system level testpad. A
configurable and scalable multi-processor system is demonstrated on Xilinx FPGA with
and without fault tolerant & security features. The performance of single-processor and
dual-processor implementations of widely used DIT and DIF radix-2 N-point FFT
algorithms on that system is analyzed to test the practical implementation issue of
“unified scalable” fault tolerant and security architecture proposed as Re-PAM-DSP in
this thesis. The results are discussed with reference to speed up, area, power and
reconfiguration time for the co-processor/accelerator based design. The design includes
Processor block, System Bus blocks and Co-Processors architecture. The software’s were
written to demonstrate the designed system for FFT application.
4.4 Reconfigurable Coprocessor Architecture (OVLP)
Online (Serial) Arithmetic algorithms [164, 168, 170] allow larger and variable length
operands in computation. The reconfigurable arithmetic coprocessor is interfaced with
the host processor in the Online Variable Length Precision (OVLP) tasks. The
computation for multiplication, division and square root takes care of repetitive and time-
consuming operations. Basic logic is in static region and coprocessor - arithmetic
algorithms on dynamic region which can be partially configured at run time [74].
Since the emphasis is on arithmetic operations, this approach is called Partial
Reconfigurable Arithmetic Coprocessor (PRAC).
More details of OVLP Precision arithmetic is presented in Appendix E.
The implementation of coprocessor based on OVLP has some of the basic building
blocks – a control, data path and memory (M) blocks required for any computing
solution, as shown in Figure 4-1.
92
The block diagram of the coprocessor is shown in Figure 4-1. The host enables and
directly initialize the local memory which simplifies setup of OVLP algorithms [163].
Figure 4-1 Coprocessor Organisation
4.4.1 Data Path
Here in Figure 4-2, Execution Unit for implementing OVLP algorithm for multiplication,
division and square root is shown. Components and interconnections in bold lines fall
under static part, while the same in dotted lines fall under dynamic parts which are used
in reconfiguration module.
Figure 4-2 Execution Unit
93
For implementation, the design is divided in two components: a) Static portion and b)
Partial Reconfigurable portion. The maximum effort has made to keep major portion of
co-processor in the static portion as it will use the least amount of hardware and it will
also require the least configuration time, which are the two most important parameters.
Top Module
Static Shifter
Multiply Division Square Root
Figure 4-3 Hierarchy of PRAC
Control Block:
The control block takes care of data addressing and read/write control of the memory
elements that extends to managing the data registers in the data path. When the control
block is implemented in a FPGA device dynamic reconfigurable area, it can be
reconfigured for the type of arithmetic operation being executed in the co-processor.
Basically the Control block is divided in 3 parts:
Multiplication control
Division control
Square-root control
As this is a partial reconfigurable region, so it is defined in special manner. Here at any
particular time only one part from the three above mentioned control module is loaded in
the circuit. To accommodate run time reconfiguration all parts need to follow some rules:
The entity of all the three parts must be same.
This entity should be connected with static part using bus macro
94
A common component representing all three parts is used in top module. So same
signals must be used for all three parts to connect it with static part.
Advantage of Partial Reconfiguration
The proposed architecture uses reconfiguration technology of FPGA, so that it offers
following features:
Data Path can be adjusted depending on the hardware resources available in the
reconfigurable coprocessor.
Digit serial components are used to reduce the broadcast signals, saving the
interconnect resources significantly.
Reconfiguration time is significantly reduced by using simple modifications of the
data path to change from one arithmetic operation to the other (as shown in Figure
4-2).
The ability to reconfigure even makes it redundant to implement multiplexers in
data bus for output selection, as other functions are not needed. Thus the design
avoids allocation of resources that are not required for the operation being
executed.
During reconfiguration as only part of data path is reconfigured, we can have a
faster co-processor.
The reconfiguration time penalty may be optimized by delegating control to the
main processor to handle operation transitions:
1. Advanced detection of next operation: the host processor, knowing that the
coprocessor must be reconfigured, may trigger the reconfiguration phase in
advance, while other tasks are executed in the host.
2. Host takes over the coprocessor functions while the coprocessor is being
reconfigured for the next operation.
95
Implementation Results:
Table 4-1 Device Utilization Data – Comparison between Reconfigurable Unit and
Individual Units - 3 bit
Device Utilization data – Comparison between reconfigurable unit and individual
units - 3 bit
Logic Utilization
Reconfi
gurable
Unit
Combine
utilization of
Individual
Units
%
saving in
hardware
space
Maximum size
in the device
available
Number of Slice Registers 440 1048 58.0 69120
Number of Slice LUTs 442 499 11.4 69120
Number of fully used Bit
Slices 325 442 26.4 357
Number of bonded IOBs 598 357 ---- 640
Number of BUFG/
BUFGCTRLs 1 1 -- 32
Table 4-2 Device Utilization Data –Comparison between Static and Reconfigurable
Portions-3 bit
Device Utilization data – comparison between static and reconfigurable portions-3 bit
Logic
Utilization
Static
Portion
Reconfigurable
portion
Reconfigurable
portion as % of
static portion
Maximum size in the
device available
Number of
Slice Registers 410 30 7.3 69120
Number of
Slice LUTs 379 63 16.6 69120
Number of
fully used Bit
Slices
296 29 9.8 357
Number of
bonded IOBs 456 142 31.1 640
Number of
BUFG/BUFGC
TRLs
1 1 -- 32
96
Table 4-3 Device Utilization Data – Comparison between Top Module with and
without Bus-Macro - 3 bit
Logic Utilization
Top
module
with bus-
macro
Top
module
without
bus-macro
% increase in
hardware space
due to bus-
macro
Maximum size
in the device
available
Number of Slice
Registers 69 42 64 69120
Number of Slice
LUTs 76 44 72 69120
Number of fully
used Bit Slices 38 37 2.7 357
Number of bonded
IOBs 22 22 0 640
Number of
BUFG/BUFGCTRLs 1 1 -- 32
Table 4-4 Device Utilization Data– Comparison between Reconfigurable Unit and
Individual Units - 3 bit
Logic Utilization
Reconfi
gurable
Unit
Combine
utilization of
Individual
Units
% saving in
hardware
space
Maximum size in
the device
available
Number of Slice
Registers 102 140 27.1 69120
Number of Slice LUTs 101 132 30.1 69120
Number of fully used
Bit Slices 46 64 28.2 357
Number of bonded
IOBs 115 57 ---- 640
Number of
BUFG/BUFGCTRLs 1 1 -- 32
Table 4-1 shows an interesting comparison between hardware utilization of
reconfigurable region and all individual regions. On comparing both, it can be concluded
that there is an impressive reduction in hardware utilization, in some cases it is as high as
97
58%. Table 4-2 shows that the reconfigurable portion in partially reconfigurable module
is very less; nearly 10-15%. So the reconfiguration time will be very less compared to
total reconfiguration. The Table 4-3 shows the additional hardware utilization due to the
requirement of bus-macros between static and reconfigurable modules. This table shows
the impact of bus-macro on top layer is high but if we compare the result for partially
reconfigurable unit to individual units in Table 4-4 the hardware space is saved due to
partial reconfiguration. But in the case of 32 bit co-processor the saving in space is much
higher than 3 bit co-processor.
4.4.2 Analysis of OVLP Implementation
From the synthesis report one can say that there is a high similarity between on line
arithmetic algorithms. So 85-90% portion is of the co-processor in static part reduces
reconfiguration time as only 10-15% portion needed to be reconfigured. Compared to the
total device utilization of individual models of multiplication, division and square-root,
the reconfigurable module saves a 20 % of hardware.
Here the arithmetic algorithms are implemented for radix-2 base where only 1 bit is
processed at one clock cycle. But if enough hardware is available, the higher radix
algorithms can also be implemented which can process 2 (radix-4), 3(radix-8), 4(radix-
16) etc bits in a single clock cycle. This gives a speed improvement of 4x for Radix 4
algorithm.
The Implementation results shows area-time estimate for partially reconfigurable co-
processor on Xilinx Virtex5 XC5VLX110Tff1136-1FPGA. The main building blocks
used in the VLP(variable long precision) arithmetic circuits for radix-2 based algorithms
can be easily upgraded to higher radix based algorithms according speed and hardware
constraints.
4.5 Multiple Coprocessor Interface and performance improvement
Here multiple coprocessor [46] platform is demonstrated using open source LEON3
processor. The IEEE 754 FPU is designed and interfaced with AMBA bus through
designed AMBA Slave wrapper.
98
The configuration capabilities enable a user of LEON3 Processor system [17, 35, 42] to
customize it for certain application. The system peripherals are connected with the
LEION3 core through AMBA bus. This research work configured the core to have
separate instruction and data cache with respective controllers each, connected through an
AMBA bus interface, interrupt port and a hardware divide & multiply unit. The integer
unit (IU) and data path use a 7 stage pipeline, elaborated in the block diagram in Figure
4-4.
Figure 4-4 LEON3 core block diagram
4.5.1 Interfacing FPU on AHB
A 5– stage single precision FPU [168], is designed and coded using VHDL and
interfaced with LEON-3 core to create SOC for Evaluation with existing IU, GRFPU,
GRFPU Lite. Multiple FPUs are connected with LEON3 through AMBA-AHB bus.
Challenges
As LEON3 is a SPARC V8 [170, 173, 186] architecture, it has the instructions for the
floating point operations. The floating point instructions for the SPARC architecture are
defined as fpop (floating point operation instructions). They do not perform load and
store between memory and FPU. The FPU has 32 numbers of 32-bit floating-point
99
registers. Data is moved between FPU and memory through Floating-point load/store
instructions where the IU calculates memory addresses, while the actual floating-point
arithmetic is done by Floating-Point operate (FPop) instructions. Gaisler Research
provides GRFPU [187] and GRFPU Lite [187] which can be integrated directly to the
GRFPC. But, these cores are available only in the pre-compiled netlist forms (*.edf).
Interfacing custom FPU at the execution unit requires complex hardware description at
module level. IU pipeline needs to be strolled to match the port mapping of the FPU. So,
here new FPU is designed; to be integrated with LEON3 core on AHB Bus.
AMBA AHB
The AMBA AHB [163] implements features required for high- performance and high
clock frequency systems including: Burst Transfers, Split Transactions, Single Cycle Bus
Mater handover, Single Clock Operation, Pipelined Bus etc.
AHB Slave Wrapper
The wrappers for the custom IPs [163, 167] are required to be written to handle different
types of transfer between the bus and the custom IP. They are like - polling based,
interrupt based and DMA based. Figure 4-5 shows a conceptual view of a wrapper for a
custom IP.
Figure 4-5 Custom IP-Wrapper Logic Diagram
Register –FPU Mapping
The LEON3 [171, 175] is a master of the bus. Execution unit of LEON sends the
necessary data for the FPU. FPU operates upon the sent data and generates the output.
100
LEON sends the request to read the output and thus the output data of the particular
operation is available at LEON. As FPU cannot initiate the data transfer request on AHB,
it is known as AHB slave. The system with FPU as an AHB Slave is shown in the Figure
4-6.
The port map of this wrapper in top level design is as follows: The AHB index hindex is
3 for FPU and AHB address mask for FPU is haddr, which is 16\#700\#. It means that
address range assigned to the FPU starts from 0x70000000. The total size in memory is
1MByte. It is cacheable and allows prefetch.
Figure 4-6 AHB Slave Wrapper Block Diagram
Figure 4-7 Portmap of FPU Entity
4.5.2 Multiple FPU interface and results
Multiple FPUs (8 numbers) are instantiated to configure LEON3 based SOC. The FPUs
are IEEE 754 Compliant. This is shown in Figure 4-8.
101
Figure 4-8 LEON3 based SoC having 8 FPU interfaced with AMBA AHB Bus
The SoC consisting of LEON3 with FPU has been successfully designed using VHDL,
simulated, verified and synthesized. It has been ported on Avnet- Xilinx xc4vlx60 board
for FPGA prototyping. Timing comparison for different designs measured using top level
C file is shown in Table 4-5.
Designed FPU (now referred as Nirma FPU) provides 3.44 µs latency for any floating
point calculation. It is much higher than that of the licensed cores GRFPU and GRFPU-
Lite. But, if the floating point calculations are executed by integer unit, then time taken
by each operation is on an average around 14 µs. The speed up of GRFPU/GRFPU-Lite
over NIrma FPU is inevitable since GRFPU is directly integrated in LEON execution
unit, so AHB read/write latencies are not involved in direct interface. The speed boost
for floating point calculation with our FPU is 4 times than the Integer Unit based
computation of floating point using Software Library. Nirma FPU provides acceptable
trade-off between time to market, development cost and latency.
102
Table 4-5 Timing Analysis in uS on Avnet Xc4VLX60[43]
FPU Type with Clock Frequency Fclock= 51 MHz
FPU
Operation
LEON3 +
IU : FPU using
software
Library µS
LEON3
+
GRFPU
µS
LEON3+
GRFPU
LITE
µS
LEON3 +
AHB –FPU
(Nirma)
µS
Additions 13 0.649 0.688 3.44
Subtraction 11 0.649 0.668 3.44
Multiplication 16 0.629 0.688 3.44
Division 16 0.885 0.962 3.44
Table 4-6 Device Utilization (%) Summary On XILINX AVNET V4LX60
Board AVNET XC4VLX60
Design LEON3
without
FPU
LEON3
+
GRFPU
LEON3+
GRFPU
LITE
LEON3
+FPU
(Our)
LEON3
+
8FPUs
Slices 44 66 45 44 56
Slice Flip Flops 11 16 12 11 14
4 Input LUTs 36 58 41 36 46
IO Blocks 29 29 29 29 29
FIFO16/RAM16 17 17 17 17 17
GCLKs 34 37 37 34 34
DCM_ADVs 50 50 50 50 50
DSP48s 12 31 6 12 12
Looking to Table 4-6, the device utilization for LEON3 core with 8 FPU is just 10 %
increase in basic logic resources like slices, flip flops and LUTs.
The Power comparison of LEON3 core based SOC with and without FPU is shown in
Table 4-7.
Table 4-7 Power Supply Summary
Board Design in AVNET XC4VLX60
Supply Power mW LEON3 Without
FPU
LEON3 +FPU LEON3 + 8 FPU
Total 2421.73mW 2348.54 mW 2348.54 mW
Dynamic 653.37mW 583.83 mW 583.83 mW
Quiescent 1761.36mW 1764.71 mW 1764.71 mW
103
There is 3 % rise in power consumption when 8 FPUs are configures with LEON3 IP
core through AMBA [172] AHB bus.
Figure 4-9 Comparison of power consumption versus number of FPU at different
frequencies
Figure 4-9 shows the power consumption in mW of FPUs operated at various frequencies
of 25, 50 and 100 MHz. There is very little variation in power consumption by increasing
the number of FPUs. The Power consumption doubles as we double the frequency .i.e. it
follows the rule of power dynamic power consumption.
4.6 Multiple Processor for performance improvement with DSP algorithms
Here Xilinx Microblaze [178] soft core processor is used to create platform to
demonstrate the multiprocessor approach to improve the performance for computation of
DSP algorithms. Further it is analyzed by implementing FFT algorithm on the platform
and the speed up is reported.
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
No FPU FPU 4 FPU 8 FPU
100MHz
50MHz
25 MHz
104
Figure 4-10 Microblaze core block diagram
The MicroBlaze soft core operates on 25-210 MHz. It is a 32 bit Harvard RISC
architecture with 32 bit LUT RAM-based register file, size-optimized (3 stage) or speed-
optimized (5 stage) single-issue pipeline, on-chip BRAM as well as external memory.
Figure 4-10 shows its functional block diagram. The features shown in shaded
background are optional, such as Memory Management Unit (MMU), Floating Point Unit
(FPU), barrel shifter etc.
Multi-MicroBlaze system:
In a multi-MicroBlaze system, all processors share the same On-chip Peripheral Bus
(OPB), clock generator, system reset and access to the IPs connected to OPB; whereas all
processors have individual BRAM, Instruction Local Memory Bus (ILMB), Data Local
Memory Bus (DLMB), ILMB controller and DLMB controller. The inter-processor
communication is achieved by the Fast Simplex Link (FSL). FSL is 32-bit wide
unidirectional (master-slave), point-to-point link with FIFO having upto 8K depth. Each
MicroBlaze can have up to 8 input and 8 output FSL interfaces. Multiple MicroBlazes
can be connected via FSLs in a variety of topologies such as star, ring, mesh and tree.
CoreMark Benchmark:
There is one generic open source benchmark known as CoreMark [157] that overcomes
the drawbacks of Dhrystone i.e. Dhrystone code is susceptible to smart compiler
optimizations. CoreMark has three main algorithms. The first one is linked list
manipulation which deals with data access through pointers and hence exercises the
105
memory units. The second one is matrix manipulation which involves serial data access
(simple operations) and thus potentially uses instruction level parallelism. The third
algorithm is a simple Moore state machine which exercises the branch unit in the
pipeline.
CoreMark evaluation of MicroBlaze:
At 100 MHz, MicroBlaze took a total 10,842 ticks (10 ms per tick) for 10,000 iterations
of the CoreMark code. Hence, CoreMark (Iterations/sec) = 92.23 and CoreMark/MHz =
0.9223. CoreMark states that on overlooking cache coherency, bus arbitrations
mechanisms and the efficiency of the scheduler, the CoreMark/MHz score increases
linearly with the number of processors in a multi-processor system as shown in Figure
4-11. So, for a N-processor system, CoreMark/MHz = 0.9223 x N.
Figure 4-11 Linear speedup with increase in the number of processors.
4.6.1 System design for evaluation of FFT algorithms
Single MicroBlaze:
Figure 4-12 Single MicroBlaze implementation block diagram.
106
The System architecture of a Single-MicroBlaze system is shown in Figure 4-12. The
FFT algorithm [158,160] is coded in C language and stored in the on-chip BRAM. The
connection between BRAM and MicroBlaze is through ILMB and DLMB. The On-chip
Peripheral Bus (OPB) is used for connecting UART and timer peripherals to MicroBlaze.
The output of the UART terminal is redirected to the PC COM port and can be seen on
the HyperTerminal screen.
Dual MicroBlaze:
The architecture of the Dual-MicroBlaze system is shown in Figure 4-13. Each
MicroBlaze has its own BRAM, ILMB and DLMB and hence has its own software
executing in parallel fashion. The connection between the two MicroBlaze processors is
established via two FSL interfaces [159]. Both the MicroBlazes share a common OPB
connected to UART and timer peripherals. The output of the UART can be seen on the
HyperTerminal screen.
Figure 4-13 Dual MicroBlaze implementation block diagram.
4.6.2 Software implementation
Single MicroBlaze:
The entire application runs on the top of Xilinx Micro Kernel (XMK) components like
standard C libraries, Standalone Board Support Package (BSP) and Xilkernel.
There are many options as to the implementation of the FFT algorithm. First one is the
107
recursive method which directly reflects the mathematical derivation of the algorithm.
But this approach has unacceptably high memory requirements for higher values of N. In
Virtex-5 FPGAs, the maximum available size of BRAM is 256 KB. To counter this
limitation on memory, a recursive in-place algorithm can be employed. But deeply
recursive routines have higher chances of causing stack overflows.
Thus, a non-recursive in-place algorithm is used. It is the step-by-step imitation of the
flow graphs shown in Figure 4-12 and Figure 4-13.The computations are done in-place so
that the need of huge temporary arrays is eliminated. For the bit-reversal operations, it is
observed that performing bit-reversal twice gives the same result as the original input.
Hence, bit-reversing is achieved by mere swapping of specific elements. The line-by-line
profiling of the entire algorithm reveals that majority of the clock cycles are spent in the
calculation of the cosine and sine functions which are needed for twiddle factors. So, for
a given stage out of log2N stages, the cosine and sine are not calculated each time.
Rather, their values are calculated for the first occurrence and then that value is
multiplied by an appropriate multiplier with a step size which depends on the stage
wherein that computation occurs. Obviously, one complex multiplication is much faster
than one cosine and sine calculation. Furthermore, after a twiddle factor is calculated, all
those operations in the current stage that depend on that value are performed then on its
own. The above optimizations speed up the algorithm and make it more memory
efficient.
Dual MicroBlaze:
In the dual MicroBlaze system, the goal is to distribute the complexity of the algorithm
on both the processors as equally as possible. The bit-reversal is done on microblaze_0
before or after the FFT computation, as may be the case (DIT or DIF) [213-215]. Out of
the log2N stages, the first or the last stage (depending on whether it is DIF or DIT) is such
that it is more economical to compute it on a single processor (microblaze_0). The other
stages involve calculations that are independent of each other. Thus, N-valued input is
divided into two groups of N/2 values. One group is sent to mocroblaze_1 via FSL and
log2N-1 stages are computed in parallel manner, i.e. in half of the original time.
Afterwards, N/2 values are received from microblaze_1.
108
Results and inferences:
Figure 4-14 Plot of number of clock cycles versus the values of N.
4.6.3 Analysis of Results
The results of the above implementation are summarized in Figure 4-14.In spite of the
limitation of 256KB BRAM on Virtex-5, results for values of N up to 4096 are recorded.
This shows how memory efficient the implementation is. All the four plots are straight
lines. The plots of DIT and DIF are very closely spaced. As expected, the slope of the
line for dual processor systems is much less than that of single processor systems.
From the slope of the line, we can calculate the number of processor clock cycles needed
for a unit increase in the value of N. For single processor DIT system, the value is
approximately 20,000 which is remarkably low. This is a result of the various
optimizations incorporated in the software implementation. This value is further reduced
to around 12,000 for dual processor systems. This is summarized in Table 4-8.
Table 4-8 Comparison of number of clock cycles required for FFT algorithm for
single and dual MicroBlaze
Clock cycles / N DIT DIF
Single MicroBlaze 20,068 19,873
Dual MicroBlaze 12,061 12,061
109
Figure 4-15 Plot of speedup versus the values of N.
The ideal speedup for a dual MicroBlaze system is 2. The practical value of speedup
depends on the communication overhead due to FSLs, and due to parts of the algorithm
that are executed serially (bit-reversal and 1 stage of the N-point FFT). Speedup values
are shown in Figure 4-15.The speedup in both DIT and DIF is 1.6 for reasonably high
values of N. For lower values of N, the serially executing part takes more time than the
parallel part and thus, the speedup is not fully realized.
Table 4-9 shows the device utilization of Virtex-5 for a single processor system. It is very
clear that Virtex-5 has extensive unutilized space to accommodate more and more
MicroBlaze instances. Moreover, custom hardware IPs such as floating-point accelerator
or trigonometric accelerator can be added in order to achieve increased speedup.
Table 4-9 Resource utilization for single processor system
Utilized Available
Slice LUT-FF pairs 4,990 69,120
IOBs 4 640
DSP48Es 6 64
110
From the algorithm point of view, other algorithms such as Radix-4 FFT, Split-radix
FFT, multidimensional FFT etc. are left to be explored. Other DSP algorithms related to
Finite Impulse Response (FIR) or Infinite Impulse Response (IIR) filter design can also
be implemented.
Another interesting field is the parallelization of an application, given a specific number
of processors and other hardware details. It involves the analysis of program hotspots,
call trees, data dependencies etc.
4.7 Extensive test cases evaluation on SEU fault tolerant platform
Architecturally implemented systems of Coprocessor, FPU, Multiprocessor System and
LEON-3 SoC are used as extensive test cases for SEU fault tolerant system implemented
as Re-PAM DSP platform. The fault emulation results and it analysis is presented in
Table 4-10.
Table 4-10 Test case evaluation on SEU fault injection
Scenario Co-
Processor
FPU Multi Processor
System
SoC
All Injected Random faults 10000 10000 12000 12000
Fault that affected the system 257 407 530 612
Faults recovered using
scrubbing
142 298 400 434
Faults recovered using Parallel
Scrubbing
245 388 499 576
Unrecovered faults 12 19 31 36
Random single bit flip faults were injected using SEU monitor system after configuring
the Re-PAM DSP for specific application loading. The number of random faults varied
from 10000 to 12,000 depends on the application ported in the platform. For example we
will analyze the results for SoC application. 12000 faults are injected. 612 faults affected
the operation of the application/ computing. 434 faults were detected and corrected using
SEU macro controller mechanism. The remaining faults caused the error in the
111
functionality. The proposed new approach of parallel scrubbing using multiple
instantiation of SEU Macro (Here two SEU macro were used) Using this technique 576
faults are recovered and only 36 faults could not be recovered. Looking to all other
application it is observed that parallel scrubbing offers high achievement in reducing the
no of faults unrecovered. It can be further improved and reduce the no of unrecovered
fault by changing the scrubbing rate and multiple SEU macro at the cost of area
overhead. Here one more thing is observed that the larger the complexity of
application/component/ module results in more unrecovered faults.
4.8 Summary
In this chapter we have demonstrated the multiple co-processor interfaces with the
processor through AMBA Bus. AMBA Slave wrapper [180-182] has been designed
which can be used for any slave IP generated which needs to be interacted with main
processor through AMBA Bus. IEEE 754 compliant FPU [183,188] is designed. It is
used to demonstrate the multiple coprocessor interface to LEON3.
Designed FPU provides latency of 3.44 µs for any floating point operation, which is
almost 4 times speed offered over floating point operation by software library. Now, as
we interface 8 FPUs to the core through AMBA AHB bus, we can compute 8 Floating
Operation within 3.44 µs. There will be overhead of distributing the data but it will be a
part of Software C coding. It gives comparable improvement in performance. In future
the SoC can be scaled up by configuring multiple LEON3 cores and multiple FPUs
attached with each core. The performance can be improved many fold. The challenge will
be distribution of data and organization of results. Here the power consumption is
increased by 3 % only. It also shows that multiple FPU design is consuming very less
power. As many as 16 coprocessors can be associated with single LEON3.
For VLP algorithm, it is observed that there is a high similarity between on line
arithmetic algorithms for various arithmetic operations. So 85-90% portion of the co-
processor is existing common which can be loaded in to static part of FPGAs and
remaining 15-20 % is dynamic with reference to functionality. It reduces reconfiguration
time; as only 10-15% portion needed to be reconfigured for changing function of
112
coprocessor. Compared to total device utilization of individual modules of multiplication,
division and square-root, the reconfigurable module saves a 20 % of hardware. Here the
arithmetic algorithms are implemented for radix-2 base where only 1 bit is processed at
one clock cycle. But it can be scaled to the higher radix algorithms also, which can
process 2(radix-4), 3(radix-8), 4(radix-16) etc bits in a single clock cycle.
For multiprocessor based system it is demonstrated by two Microblaze processor directly
connected through Fast Simplex Link. The DIT and DIF algorithm are written in C and
load is distributed among them. The speedup in both DIT and DIF is 1.5 -1.6 for N:
number of DFT points above 512. For lower values of N, the serially executing part
takes more time than the parallel part and thus, the speedup is not fully realized.
Architecturally implemented systems of Coprocessor, FPU, Multiprocessor System and
LEON-3 SoC are used as extensive test cases for SEU fault tolerant system implemented
as Re-PAM-DSP platform. It is observed that Re-PAM-DSP fault tolerant system
improves the recovery from SEU. Thus system reliability is improved for high
performance DSP application requirements.