CHAPTER-4shodhganga.inflibnet.ac.in/bitstream/10603/13480/11/11_chapter 4.pdfnot just for small...

87

.

CHAPTER-4 Coprocessor based test case and discussion of results

In this chapter we begin to integrate solutions for

SEU, MBU, SCA applied to FPGAs, and test them

not just for small circuits, but at component, and

SOC level. A design of Variable length Precision

Arithmetic co-processor with a capability for

changing the precision using reconfiguration based

on application requirements is discussed. Later on,

FPU Co-processor is designed and interfaced

through AMBA bus to LEON3 Core for multiple co-

processor interfaces to create a system level test pad.

A configurable and scalable multi-processor system

is demonstrated on Xilinx FPGA with and without

fault tolerant & security features. The performance

of single-processor and dual-processor

implementations of widely used DIT and DIF radix-

2 N-point FFT algorithms on that system is analyzed

to test the practical implementation issue of “unified

scalable” fault tolerant and security architecture

proposed as Re-PAM-DSP in this thesis.

88

CHAPTER FOUR: Coprocessor based testcase and discussion of results

4.1 Introduction

The need for low power, high performance computation but without the extremely high

design costs for ASIC designs, have driven a number of designers to create a flexible,

universal platforms. Since design Non-Recurring (NRE) costs of a rad-hard ASIC are

many order of magnitude larger than fabrication, users prefers to configure/ programs a

flexible computing platform framework to run their applications with the desired

performance, especially for the extremely low volume, or even unique mission-critical

application for each satellite.

A better alternative for high bandwidth computation is to use FPGAs as a coprocessor

[217] that integrates the repetitive speed-critical portion of algorithms into the FPGAs.

In many applications, the data comes at high data rates such as 50MSPS and some sensor

nodes data rates can be much smaller as 50KSPS in either way, the processing chain is

complex. DSPs do not have the bandwidth necessary to meet the real time deadline and

the processing must be dealt offline.

The difficulty with DSPs in these applications is that DSPs are serial machines and in

some high speed DSPs, specific instructions are processed in parallel. Here it is

optimized using software and special hardware available in specific architecture. So they

all are machine dependent and the same code cannot be reused for other platform.

These issues can be solved by taking advantages of the ability to convert cascaded set of

operations into a parallel structure that operates in several clock cycles at high frequency.

In most DSP applications, the designers have few options to increase performance

beyond optimizing by using specialized DSP assembly language instructions or

upgrading DSP processor. With FPGAs approach, the hardware, as well as software can

be simultaneously optimized, moreover the designer can change the partition of the

systems, modify the algorithm, update the hardware in runtime for different

functionalities, control the hardware of the systems.

89

Other part is that, processing the element of FPGAs is not just used for prototyping

custom hardware modules but also for parallel processing, by implementing multiple

processors for a single task/ multiple tasks.

Multiple processor hardware accelerators enable, to distribute tasks of applications to

several microprocessors in order to exploit parallelism for accelerating the performance

of computation; especially in the application flow of image processing, where

computation performance is a crucial factor to keep real time requirements. The DSP

applications are promising areas for exploiting co-processors.

Here three different types of co-processor platforms are addressed for performance

improvement:

1. Variable length Precision Arithmetic co-processor is designed and which has

capability for changing precision and functionality using partial run time

reconfiguration.

2. FPU Co-processor is designed and interfaced via direct coupling and also through

AMBA bus for multiple processor interface.

3. Performance by multiprocessor platform: Here Xilinx Microbalze softcore

processor used for demonstrating parallel processing. The interface among the

processor is carried out by Linear array using FSL (Fast Simplex Link). The FFT

algorithms are executed on the processor core benchmark and the speed up is to

be analyzed.

4.2 Detailed Problem Definition

FPGA is a way out for low volume custom design, more common for mission critical

applications, including space applications. This has lead to custom solutions where

VIRTEX series FPGA from Xilinx are quite popular. But there is need for unified

platform independent FPGA architectures, not just for design, but integrating fault

tolerant, reliability and robust security solutions. The issue extends to a viable means for

test the integration of various piecemeal solutions into a singular whole.

90

Current literatures, to evaluate secure and fault tolerant FPGA architectures use shift

registers, small combinatorial circuits and small memory arrays. A large body of

published research uses [3-6, 8-9, 14, 26-27] circuit blocks like shift registers, small

memory arrays or combinatorial cells as test cases. The robustness of such FPGA

architecture needs to be tested for larger test cases that include ALUs, MACs, and

medium sized co-processor implementations. Selection of small test cases makes sense

for fault tolerance solutions that rely on clever circuit insertions, or creating the correct

hooks for detecting / correcting SEU faults. This is a limiting scenario that does not test

the scalability of the proposal for various sizes and types of applications. It especially

leaves an intangible but real danger of difficulty in integrating the solution proposals at

system level. A robust solution needs to be seamlessly viable from smallest level of

circuit detail to large system level, and yet be compatible to practical design phases for

many types of applications in the industry.

Significant literature is published over the years to resolve drawbacks of each generation

of FPGA. Prominent among them is proposal, and eventual deployment of an Internal

Configuration Access Port (ICAP) [54, 57, 58] for scrubbing of SEU errors. Although

ICAP has many advantages, the greatest limitation is its proprietary applications for

FPGA solutions rolled out by Xilinx Corp. A comprehensive test of scalability in size of

application, type of applications, to various proprietary FPGA platforms, and yet be

modular enough in integrating all the features, to choose the level of fault tolerance and

security is required for any serious proposal.

4.3 Method of Solution

Testing of the Re-PAM-DSP architecture proposal had to cover design and validation

experience for a real life application. The endeavour is to integrate solutions for SEU,

MBU, SCA applied to FPGAs, and test them not just for small circuits, but at component,

and SOC level. It follows with the co-processor/accelerator interface with SoC. Variable

Length Arithmetic co-processor is designed using partial reconfirmation. A design of

91

Variable length Precision Arithmetic co-processor with a capability for changing the

precision using reconfiguration based on application requirements is discussed. A

scalable design is proposed to improve performance by incorporating multiple co-

processors. Later on FPU Co-processor is designed and interfaced through AMBA bus to

LEON3 Core for multiple coprocessor interfaces to create a system level testpad. A

configurable and scalable multi-processor system is demonstrated on Xilinx FPGA with

and without fault tolerant & security features. The performance of single-processor and

dual-processor implementations of widely used DIT and DIF radix-2 N-point FFT

algorithms on that system is analyzed to test the practical implementation issue of

“unified scalable” fault tolerant and security architecture proposed as Re-PAM-DSP in

this thesis. The results are discussed with reference to speed up, area, power and

reconfiguration time for the co-processor/accelerator based design. The design includes

Processor block, System Bus blocks and Co-Processors architecture. The software’s were

written to demonstrate the designed system for FFT application.

4.4 Reconfigurable Coprocessor Architecture (OVLP)

Online (Serial) Arithmetic algorithms [164, 168, 170] allow larger and variable length

operands in computation. The reconfigurable arithmetic coprocessor is interfaced with

the host processor in the Online Variable Length Precision (OVLP) tasks. The

computation for multiplication, division and square root takes care of repetitive and time-

consuming operations. Basic logic is in static region and coprocessor - arithmetic

algorithms on dynamic region which can be partially configured at run time [74].

Since the emphasis is on arithmetic operations, this approach is called Partial

Reconfigurable Arithmetic Coprocessor (PRAC).

More details of OVLP Precision arithmetic is presented in Appendix E.

The implementation of coprocessor based on OVLP has some of the basic building

blocks – a control, data path and memory (M) blocks required for any computing

solution, as shown in Figure 4-1.

92

The block diagram of the coprocessor is shown in Figure 4-1. The host enables and

directly initialize the local memory which simplifies setup of OVLP algorithms [163].

Figure 4-1 Coprocessor Organisation

4.4.1 Data Path

Here in Figure 4-2, Execution Unit for implementing OVLP algorithm for multiplication,

division and square root is shown. Components and interconnections in bold lines fall

under static part, while the same in dotted lines fall under dynamic parts which are used

in reconfiguration module.

Figure 4-2 Execution Unit

93

For implementation, the design is divided in two components: a) Static portion and b)

Partial Reconfigurable portion. The maximum effort has made to keep major portion of

co-processor in the static portion as it will use the least amount of hardware and it will

also require the least configuration time, which are the two most important parameters.

Top Module

Static Shifter

Multiply Division Square Root

Figure 4-3 Hierarchy of PRAC

Control Block:

The control block takes care of data addressing and read/write control of the memory

elements that extends to managing the data registers in the data path. When the control

block is implemented in a FPGA device dynamic reconfigurable area, it can be

reconfigured for the type of arithmetic operation being executed in the co-processor.

Basically the Control block is divided in 3 parts:

Multiplication control

Division control

Square-root control

As this is a partial reconfigurable region, so it is defined in special manner. Here at any

particular time only one part from the three above mentioned control module is loaded in

the circuit. To accommodate run time reconfiguration all parts need to follow some rules:

The entity of all the three parts must be same.

This entity should be connected with static part using bus macro

94

A common component representing all three parts is used in top module. So same

signals must be used for all three parts to connect it with static part.

Advantage of Partial Reconfiguration

The proposed architecture uses reconfiguration technology of FPGA, so that it offers

following features:

Data Path can be adjusted depending on the hardware resources available in the

reconfigurable coprocessor.

Digit serial components are used to reduce the broadcast signals, saving the

interconnect resources significantly.

Reconfiguration time is significantly reduced by using simple modifications of the

data path to change from one arithmetic operation to the other (as shown in Figure

4-2).

The ability to reconfigure even makes it redundant to implement multiplexers in

data bus for output selection, as other functions are not needed. Thus the design

avoids allocation of resources that are not required for the operation being

executed.

During reconfiguration as only part of data path is reconfigured, we can have a

faster co-processor.

The reconfiguration time penalty may be optimized by delegating control to the

main processor to handle operation transitions:

1. Advanced detection of next operation: the host processor, knowing that the

coprocessor must be reconfigured, may trigger the reconfiguration phase in

advance, while other tasks are executed in the host.

2. Host takes over the coprocessor functions while the coprocessor is being

reconfigured for the next operation.

95

Implementation Results:

Table 4-1 Device Utilization Data – Comparison between Reconfigurable Unit and

Individual Units - 3 bit

Device Utilization data – Comparison between reconfigurable unit and individual

units - 3 bit

Logic Utilization

Reconfi

gurable

Unit

Combine

utilization of

Individual

Units

%

saving in

hardware

space

Maximum size

in the device

available

Number of Slice Registers 440 1048 58.0 69120

Number of Slice LUTs 442 499 11.4 69120

Number of fully used Bit

Slices 325 442 26.4 357

Number of bonded IOBs 598 357 ---- 640

Number of BUFG/

BUFGCTRLs 1 1 -- 32

Table 4-2 Device Utilization Data –Comparison between Static and Reconfigurable

Portions-3 bit

Device Utilization data – comparison between static and reconfigurable portions-3 bit

Logic

Utilization

Static

Portion

Reconfigurable

portion

Reconfigurable

portion as % of

static portion

Maximum size in the

device available

Number of

Slice Registers 410 30 7.3 69120

Number of

Slice LUTs 379 63 16.6 69120

Number of

fully used Bit

Slices

296 29 9.8 357

Number of

bonded IOBs 456 142 31.1 640

Number of

BUFG/BUFGC

TRLs

1 1 -- 32

96

Table 4-3 Device Utilization Data – Comparison between Top Module with and

without Bus-Macro - 3 bit

Logic Utilization

Top

module

with bus-

macro

Top

module

without

bus-macro

% increase in

hardware space

due to bus-

macro

Maximum size

in the device

available

Number of Slice

Registers 69 42 64 69120

Number of Slice

LUTs 76 44 72 69120

Number of fully

used Bit Slices 38 37 2.7 357

Number of bonded

IOBs 22 22 0 640

Number of

BUFG/BUFGCTRLs 1 1 -- 32

Table 4-4 Device Utilization Data– Comparison between Reconfigurable Unit and

Individual Units - 3 bit

Logic Utilization

Reconfi

gurable

Unit

Combine

utilization of

Individual

Units

% saving in

hardware

space

Maximum size in

the device

available

Number of Slice

Registers 102 140 27.1 69120

Number of Slice LUTs 101 132 30.1 69120

Number of fully used

Bit Slices 46 64 28.2 357

Number of bonded

IOBs 115 57 ---- 640

Number of

BUFG/BUFGCTRLs 1 1 -- 32

Table 4-1 shows an interesting comparison between hardware utilization of

reconfigurable region and all individual regions. On comparing both, it can be concluded

that there is an impressive reduction in hardware utilization, in some cases it is as high as

97

58%. Table 4-2 shows that the reconfigurable portion in partially reconfigurable module

is very less; nearly 10-15%. So the reconfiguration time will be very less compared to

total reconfiguration. The Table 4-3 shows the additional hardware utilization due to the

requirement of bus-macros between static and reconfigurable modules. This table shows

the impact of bus-macro on top layer is high but if we compare the result for partially

reconfigurable unit to individual units in Table 4-4 the hardware space is saved due to

partial reconfiguration. But in the case of 32 bit co-processor the saving in space is much

higher than 3 bit co-processor.

4.4.2 Analysis of OVLP Implementation

From the synthesis report one can say that there is a high similarity between on line

arithmetic algorithms. So 85-90% portion is of the co-processor in static part reduces

reconfiguration time as only 10-15% portion needed to be reconfigured. Compared to the

total device utilization of individual models of multiplication, division and square-root,

the reconfigurable module saves a 20 % of hardware.

Here the arithmetic algorithms are implemented for radix-2 base where only 1 bit is

processed at one clock cycle. But if enough hardware is available, the higher radix

algorithms can also be implemented which can process 2 (radix-4), 3(radix-8), 4(radix-

16) etc bits in a single clock cycle. This gives a speed improvement of 4x for Radix 4

algorithm.

The Implementation results shows area-time estimate for partially reconfigurable co-

processor on Xilinx Virtex5 XC5VLX110Tff1136-1FPGA. The main building blocks

used in the VLP(variable long precision) arithmetic circuits for radix-2 based algorithms

can be easily upgraded to higher radix based algorithms according speed and hardware

constraints.

4.5 Multiple Coprocessor Interface and performance improvement

Here multiple coprocessor [46] platform is demonstrated using open source LEON3

processor. The IEEE 754 FPU is designed and interfaced with AMBA bus through

designed AMBA Slave wrapper.

98

The configuration capabilities enable a user of LEON3 Processor system [17, 35, 42] to

customize it for certain application. The system peripherals are connected with the

LEION3 core through AMBA bus. This research work configured the core to have

separate instruction and data cache with respective controllers each, connected through an

AMBA bus interface, interrupt port and a hardware divide & multiply unit. The integer

unit (IU) and data path use a 7 stage pipeline, elaborated in the block diagram in Figure

4-4.

Figure 4-4 LEON3 core block diagram

4.5.1 Interfacing FPU on AHB

A 5– stage single precision FPU [168], is designed and coded using VHDL and

interfaced with LEON-3 core to create SOC for Evaluation with existing IU, GRFPU,

GRFPU Lite. Multiple FPUs are connected with LEON3 through AMBA-AHB bus.

Challenges

As LEON3 is a SPARC V8 [170, 173, 186] architecture, it has the instructions for the

floating point operations. The floating point instructions for the SPARC architecture are

defined as fpop (floating point operation instructions). They do not perform load and

store between memory and FPU. The FPU has 32 numbers of 32-bit floating-point

99

registers. Data is moved between FPU and memory through Floating-point load/store

instructions where the IU calculates memory addresses, while the actual floating-point

arithmetic is done by Floating-Point operate (FPop) instructions. Gaisler Research

provides GRFPU [187] and GRFPU Lite [187] which can be integrated directly to the

GRFPC. But, these cores are available only in the pre-compiled netlist forms (*.edf).

Interfacing custom FPU at the execution unit requires complex hardware description at

module level. IU pipeline needs to be strolled to match the port mapping of the FPU. So,

here new FPU is designed; to be integrated with LEON3 core on AHB Bus.

AMBA AHB

The AMBA AHB [163] implements features required for high- performance and high

clock frequency systems including: Burst Transfers, Split Transactions, Single Cycle Bus

Mater handover, Single Clock Operation, Pipelined Bus etc.

AHB Slave Wrapper

The wrappers for the custom IPs [163, 167] are required to be written to handle different

types of transfer between the bus and the custom IP. They are like - polling based,

interrupt based and DMA based. Figure 4-5 shows a conceptual view of a wrapper for a

custom IP.

Figure 4-5 Custom IP-Wrapper Logic Diagram

Register –FPU Mapping

The LEON3 [171, 175] is a master of the bus. Execution unit of LEON sends the

necessary data for the FPU. FPU operates upon the sent data and generates the output.

100

LEON sends the request to read the output and thus the output data of the particular

operation is available at LEON. As FPU cannot initiate the data transfer request on AHB,

it is known as AHB slave. The system with FPU as an AHB Slave is shown in the Figure

4-6.

The port map of this wrapper in top level design is as follows: The AHB index hindex is

3 for FPU and AHB address mask for FPU is haddr, which is 16\#700\#. It means that

address range assigned to the FPU starts from 0x70000000. The total size in memory is

1MByte. It is cacheable and allows prefetch.

Figure 4-6 AHB Slave Wrapper Block Diagram

Figure 4-7 Portmap of FPU Entity

4.5.2 Multiple FPU interface and results

Multiple FPUs (8 numbers) are instantiated to configure LEON3 based SOC. The FPUs

are IEEE 754 Compliant. This is shown in Figure 4-8.

101

Figure 4-8 LEON3 based SoC having 8 FPU interfaced with AMBA AHB Bus

The SoC consisting of LEON3 with FPU has been successfully designed using VHDL,

simulated, verified and synthesized. It has been ported on Avnet- Xilinx xc4vlx60 board

for FPGA prototyping. Timing comparison for different designs measured using top level

C file is shown in Table 4-5.

Designed FPU (now referred as Nirma FPU) provides 3.44 µs latency for any floating

point calculation. It is much higher than that of the licensed cores GRFPU and GRFPU-

Lite. But, if the floating point calculations are executed by integer unit, then time taken

by each operation is on an average around 14 µs. The speed up of GRFPU/GRFPU-Lite

over NIrma FPU is inevitable since GRFPU is directly integrated in LEON execution

unit, so AHB read/write latencies are not involved in direct interface. The speed boost

for floating point calculation with our FPU is 4 times than the Integer Unit based

computation of floating point using Software Library. Nirma FPU provides acceptable

trade-off between time to market, development cost and latency.

102

Table 4-5 Timing Analysis in uS on Avnet Xc4VLX60[43]

FPU Type with Clock Frequency Fclock= 51 MHz

FPU

Operation

LEON3 +

IU : FPU using

software

Library µS

LEON3

+

GRFPU

µS

LEON3+

GRFPU

LITE

µS

LEON3 +

AHB –FPU

(Nirma)

µS

Additions 13 0.649 0.688 3.44

Subtraction 11 0.649 0.668 3.44

Multiplication 16 0.629 0.688 3.44

Division 16 0.885 0.962 3.44

Table 4-6 Device Utilization (%) Summary On XILINX AVNET V4LX60

Board AVNET XC4VLX60

Design LEON3

without

FPU

LEON3

+

GRFPU

LEON3+

GRFPU

LITE

LEON3

+FPU

(Our)

LEON3

+

8FPUs

Slices 44 66 45 44 56

Slice Flip Flops 11 16 12 11 14

4 Input LUTs 36 58 41 36 46

IO Blocks 29 29 29 29 29

FIFO16/RAM16 17 17 17 17 17

GCLKs 34 37 37 34 34

DCM_ADVs 50 50 50 50 50

DSP48s 12 31 6 12 12

Looking to Table 4-6, the device utilization for LEON3 core with 8 FPU is just 10 %

increase in basic logic resources like slices, flip flops and LUTs.

The Power comparison of LEON3 core based SOC with and without FPU is shown in

Table 4-7.

Table 4-7 Power Supply Summary

Board Design in AVNET XC4VLX60

Supply Power mW LEON3 Without

FPU

LEON3 +FPU LEON3 + 8 FPU

Total 2421.73mW 2348.54 mW 2348.54 mW

Dynamic 653.37mW 583.83 mW 583.83 mW

Quiescent 1761.36mW 1764.71 mW 1764.71 mW

103

There is 3 % rise in power consumption when 8 FPUs are configures with LEON3 IP

core through AMBA [172] AHB bus.

Figure 4-9 Comparison of power consumption versus number of FPU at different

frequencies

Figure 4-9 shows the power consumption in mW of FPUs operated at various frequencies

of 25, 50 and 100 MHz. There is very little variation in power consumption by increasing

the number of FPUs. The Power consumption doubles as we double the frequency .i.e. it

follows the rule of power dynamic power consumption.

4.6 Multiple Processor for performance improvement with DSP algorithms

Here Xilinx Microblaze [178] soft core processor is used to create platform to

demonstrate the multiprocessor approach to improve the performance for computation of

DSP algorithms. Further it is analyzed by implementing FFT algorithm on the platform

and the speed up is reported.

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

No FPU FPU 4 FPU 8 FPU

100MHz

50MHz

25 MHz

104

Figure 4-10 Microblaze core block diagram

The MicroBlaze soft core operates on 25-210 MHz. It is a 32 bit Harvard RISC

architecture with 32 bit LUT RAM-based register file, size-optimized (3 stage) or speed-

optimized (5 stage) single-issue pipeline, on-chip BRAM as well as external memory.

Figure 4-10 shows its functional block diagram. The features shown in shaded

background are optional, such as Memory Management Unit (MMU), Floating Point Unit

(FPU), barrel shifter etc.

Multi-MicroBlaze system:

In a multi-MicroBlaze system, all processors share the same On-chip Peripheral Bus

(OPB), clock generator, system reset and access to the IPs connected to OPB; whereas all

processors have individual BRAM, Instruction Local Memory Bus (ILMB), Data Local

Memory Bus (DLMB), ILMB controller and DLMB controller. The inter-processor

communication is achieved by the Fast Simplex Link (FSL). FSL is 32-bit wide

unidirectional (master-slave), point-to-point link with FIFO having upto 8K depth. Each

MicroBlaze can have up to 8 input and 8 output FSL interfaces. Multiple MicroBlazes

can be connected via FSLs in a variety of topologies such as star, ring, mesh and tree.

CoreMark Benchmark:

There is one generic open source benchmark known as CoreMark [157] that overcomes

the drawbacks of Dhrystone i.e. Dhrystone code is susceptible to smart compiler

optimizations. CoreMark has three main algorithms. The first one is linked list

manipulation which deals with data access through pointers and hence exercises the

105

memory units. The second one is matrix manipulation which involves serial data access

(simple operations) and thus potentially uses instruction level parallelism. The third

algorithm is a simple Moore state machine which exercises the branch unit in the

pipeline.

CoreMark evaluation of MicroBlaze:

At 100 MHz, MicroBlaze took a total 10,842 ticks (10 ms per tick) for 10,000 iterations

of the CoreMark code. Hence, CoreMark (Iterations/sec) = 92.23 and CoreMark/MHz =

0.9223. CoreMark states that on overlooking cache coherency, bus arbitrations

mechanisms and the efficiency of the scheduler, the CoreMark/MHz score increases

linearly with the number of processors in a multi-processor system as shown in Figure

4-11. So, for a N-processor system, CoreMark/MHz = 0.9223 x N.

Figure 4-11 Linear speedup with increase in the number of processors.

4.6.1 System design for evaluation of FFT algorithms

Single MicroBlaze:

Figure 4-12 Single MicroBlaze implementation block diagram.

106

The System architecture of a Single-MicroBlaze system is shown in Figure 4-12. The

FFT algorithm [158,160] is coded in C language and stored in the on-chip BRAM. The

connection between BRAM and MicroBlaze is through ILMB and DLMB. The On-chip

Peripheral Bus (OPB) is used for connecting UART and timer peripherals to MicroBlaze.

The output of the UART terminal is redirected to the PC COM port and can be seen on

the HyperTerminal screen.

Dual MicroBlaze:

The architecture of the Dual-MicroBlaze system is shown in Figure 4-13. Each

MicroBlaze has its own BRAM, ILMB and DLMB and hence has its own software

executing in parallel fashion. The connection between the two MicroBlaze processors is

established via two FSL interfaces [159]. Both the MicroBlazes share a common OPB

connected to UART and timer peripherals. The output of the UART can be seen on the

HyperTerminal screen.

Figure 4-13 Dual MicroBlaze implementation block diagram.

4.6.2 Software implementation

Single MicroBlaze:

The entire application runs on the top of Xilinx Micro Kernel (XMK) components like

standard C libraries, Standalone Board Support Package (BSP) and Xilkernel.

There are many options as to the implementation of the FFT algorithm. First one is the

107

recursive method which directly reflects the mathematical derivation of the algorithm.

But this approach has unacceptably high memory requirements for higher values of N. In

Virtex-5 FPGAs, the maximum available size of BRAM is 256 KB. To counter this

limitation on memory, a recursive in-place algorithm can be employed. But deeply

recursive routines have higher chances of causing stack overflows.

Thus, a non-recursive in-place algorithm is used. It is the step-by-step imitation of the

flow graphs shown in Figure 4-12 and Figure 4-13.The computations are done in-place so

that the need of huge temporary arrays is eliminated. For the bit-reversal operations, it is

observed that performing bit-reversal twice gives the same result as the original input.

Hence, bit-reversing is achieved by mere swapping of specific elements. The line-by-line

profiling of the entire algorithm reveals that majority of the clock cycles are spent in the

calculation of the cosine and sine functions which are needed for twiddle factors. So, for

a given stage out of log2N stages, the cosine and sine are not calculated each time.

Rather, their values are calculated for the first occurrence and then that value is

multiplied by an appropriate multiplier with a step size which depends on the stage

wherein that computation occurs. Obviously, one complex multiplication is much faster

than one cosine and sine calculation. Furthermore, after a twiddle factor is calculated, all

those operations in the current stage that depend on that value are performed then on its

own. The above optimizations speed up the algorithm and make it more memory

efficient.

Dual MicroBlaze:

In the dual MicroBlaze system, the goal is to distribute the complexity of the algorithm

on both the processors as equally as possible. The bit-reversal is done on microblaze_0

before or after the FFT computation, as may be the case (DIT or DIF) [213-215]. Out of

the log2N stages, the first or the last stage (depending on whether it is DIF or DIT) is such

that it is more economical to compute it on a single processor (microblaze_0). The other

stages involve calculations that are independent of each other. Thus, N-valued input is

divided into two groups of N/2 values. One group is sent to mocroblaze_1 via FSL and

log2N-1 stages are computed in parallel manner, i.e. in half of the original time.

Afterwards, N/2 values are received from microblaze_1.

108

Results and inferences:

Figure 4-14 Plot of number of clock cycles versus the values of N.

4.6.3 Analysis of Results

The results of the above implementation are summarized in Figure 4-14.In spite of the

limitation of 256KB BRAM on Virtex-5, results for values of N up to 4096 are recorded.

This shows how memory efficient the implementation is. All the four plots are straight

lines. The plots of DIT and DIF are very closely spaced. As expected, the slope of the

line for dual processor systems is much less than that of single processor systems.

From the slope of the line, we can calculate the number of processor clock cycles needed

for a unit increase in the value of N. For single processor DIT system, the value is

approximately 20,000 which is remarkably low. This is a result of the various

optimizations incorporated in the software implementation. This value is further reduced

to around 12,000 for dual processor systems. This is summarized in Table 4-8.

Table 4-8 Comparison of number of clock cycles required for FFT algorithm for

single and dual MicroBlaze

Clock cycles / N DIT DIF

Single MicroBlaze 20,068 19,873

Dual MicroBlaze 12,061 12,061

109

Figure 4-15 Plot of speedup versus the values of N.

The ideal speedup for a dual MicroBlaze system is 2. The practical value of speedup

depends on the communication overhead due to FSLs, and due to parts of the algorithm

that are executed serially (bit-reversal and 1 stage of the N-point FFT). Speedup values

are shown in Figure 4-15.The speedup in both DIT and DIF is 1.6 for reasonably high

values of N. For lower values of N, the serially executing part takes more time than the

parallel part and thus, the speedup is not fully realized.

Table 4-9 shows the device utilization of Virtex-5 for a single processor system. It is very

clear that Virtex-5 has extensive unutilized space to accommodate more and more

MicroBlaze instances. Moreover, custom hardware IPs such as floating-point accelerator

or trigonometric accelerator can be added in order to achieve increased speedup.

Table 4-9 Resource utilization for single processor system

Utilized Available

Slice LUT-FF pairs 4,990 69,120

IOBs 4 640

DSP48Es 6 64

110

From the algorithm point of view, other algorithms such as Radix-4 FFT, Split-radix

FFT, multidimensional FFT etc. are left to be explored. Other DSP algorithms related to

Finite Impulse Response (FIR) or Infinite Impulse Response (IIR) filter design can also

be implemented.

Another interesting field is the parallelization of an application, given a specific number

of processors and other hardware details. It involves the analysis of program hotspots,

call trees, data dependencies etc.

4.7 Extensive test cases evaluation on SEU fault tolerant platform

Architecturally implemented systems of Coprocessor, FPU, Multiprocessor System and

LEON-3 SoC are used as extensive test cases for SEU fault tolerant system implemented

as Re-PAM DSP platform. The fault emulation results and it analysis is presented in

Table 4-10.

Table 4-10 Test case evaluation on SEU fault injection

Scenario Co-

Processor

FPU Multi Processor

System

SoC

All Injected Random faults 10000 10000 12000 12000

Fault that affected the system 257 407 530 612

Faults recovered using

scrubbing

142 298 400 434

Faults recovered using Parallel

Scrubbing

245 388 499 576

Unrecovered faults 12 19 31 36

Random single bit flip faults were injected using SEU monitor system after configuring

the Re-PAM DSP for specific application loading. The number of random faults varied

from 10000 to 12,000 depends on the application ported in the platform. For example we

will analyze the results for SoC application. 12000 faults are injected. 612 faults affected

the operation of the application/ computing. 434 faults were detected and corrected using

SEU macro controller mechanism. The remaining faults caused the error in the

111

functionality. The proposed new approach of parallel scrubbing using multiple

instantiation of SEU Macro (Here two SEU macro were used) Using this technique 576

faults are recovered and only 36 faults could not be recovered. Looking to all other

application it is observed that parallel scrubbing offers high achievement in reducing the

no of faults unrecovered. It can be further improved and reduce the no of unrecovered

fault by changing the scrubbing rate and multiple SEU macro at the cost of area

overhead. Here one more thing is observed that the larger the complexity of

application/component/ module results in more unrecovered faults.

4.8 Summary

In this chapter we have demonstrated the multiple co-processor interfaces with the

processor through AMBA Bus. AMBA Slave wrapper [180-182] has been designed

which can be used for any slave IP generated which needs to be interacted with main

processor through AMBA Bus. IEEE 754 compliant FPU [183,188] is designed. It is

used to demonstrate the multiple coprocessor interface to LEON3.

Designed FPU provides latency of 3.44 µs for any floating point operation, which is

almost 4 times speed offered over floating point operation by software library. Now, as

we interface 8 FPUs to the core through AMBA AHB bus, we can compute 8 Floating

Operation within 3.44 µs. There will be overhead of distributing the data but it will be a

part of Software C coding. It gives comparable improvement in performance. In future

the SoC can be scaled up by configuring multiple LEON3 cores and multiple FPUs

attached with each core. The performance can be improved many fold. The challenge will

be distribution of data and organization of results. Here the power consumption is

increased by 3 % only. It also shows that multiple FPU design is consuming very less

power. As many as 16 coprocessors can be associated with single LEON3.

For VLP algorithm, it is observed that there is a high similarity between on line

arithmetic algorithms for various arithmetic operations. So 85-90% portion of the co-

processor is existing common which can be loaded in to static part of FPGAs and

remaining 15-20 % is dynamic with reference to functionality. It reduces reconfiguration

time; as only 10-15% portion needed to be reconfigured for changing function of

112

coprocessor. Compared to total device utilization of individual modules of multiplication,

division and square-root, the reconfigurable module saves a 20 % of hardware. Here the

arithmetic algorithms are implemented for radix-2 base where only 1 bit is processed at

one clock cycle. But it can be scaled to the higher radix algorithms also, which can

process 2(radix-4), 3(radix-8), 4(radix-16) etc bits in a single clock cycle.

For multiprocessor based system it is demonstrated by two Microblaze processor directly

connected through Fast Simplex Link. The DIT and DIF algorithm are written in C and

load is distributed among them. The speedup in both DIT and DIF is 1.5 -1.6 for N:

number of DFT points above 512. For lower values of N, the serially executing part

takes more time than the parallel part and thus, the speedup is not fully realized.

Architecturally implemented systems of Coprocessor, FPU, Multiprocessor System and

LEON-3 SoC are used as extensive test cases for SEU fault tolerant system implemented

as Re-PAM-DSP platform. It is observed that Re-PAM-DSP fault tolerant system

improves the recovery from SEU. Thus system reliability is improved for high

performance DSP application requirements.

CHAPTER-4shodhganga.inflibnet.ac.in/bitstream/10603/13480/11/11_chapter 4.pdfnot just for small...

Documents

Transcript of CHAPTER-4shodhganga.inflibnet.ac.in/bitstream/10603/13480/11/11_chapter 4.pdfnot just for small...