DSP Hardware

7/27/2019 DSP Hardware

1/44

1

EKT353 Lecture Notes by Professor Dr. Farid Ghani

DSP Hardware

Introduction:

Since their introduction in the early l980s, DSP processors havegrown substantially in complexity and sophistication to enhancetheir capability and range of applicability. This has also led to asubstantial increase in the number of DSP processors available.To reflect this, features of successive generations of fixed andfloatingpoint DSP processors and factors that affect choice ofDSP processors are considered in the following pages.

For convenience, DSP processors can be divided into two broadcategories: general purpose and special purpose. DSPprocessors include fixed-point devices such as Texas InstrumentsTMS320C54x, and Motorola DSP563x processors, and floating-point processors such as Texas Instruments TMS320C4x andAnalog Devices ADSP21xxx SHARC processors.

There are two types of special-purpose hardware,

1. Hardware designed for efficient execution of specific DSPalgorithms such as digital filters, Fast Fourier Transform.This type of special- purpose hardware is sometimes calledan algorithm-specific digital signal processor.

2. Hardware designed for specific applications: for exampletelecommunications, digital audio, or control applications.This type of hardware is sometimes called an application-

specific digital signal processor.

In most cases application-specific digital signal processorsexecute specific algorithms, such as PCM encoding/decoding, butare also required to perform other application-specific operations.Examples of special-purpose DSP processors are Cirruss


2/44

2


processor for digital audio sampling rate converters (CS8420),Intels multichannel telephony voice echo canceller (MT9300),FFT processor (PDSPI65I5A) and programmable FIR filter(VPDSP 16256).

Both general-purpose and special-purpose processors can bedesigned with single chips or with individual blocks of multipliers,ALUs, memories, and so on. First, we will discuss thearchitectural features of digital signal processors that have madereal-time DSP in many areas possible.

Most general purpose processors available today are based on

the Von Neumann concepts, where operations are performedsequentially. Figure 1 shows a simplified architecture for astandard Von Neumann processor. When an instruction isprocessed in such a processor, units of the processor notinvolved at each instruction phase wait idly until control is passedon to them.

Address

Generator

I/O

Devices

Accumulator

Multiplier

Product

Register

Program

and data

Memory

Address bus

Data bus

ALU

Figure 1. A simplified architecture for standard microprocessor


3/44

3


Increase in processor speed is achieved by making theindividual units operate faster, but there is a limit on how fastthey can be made to operate. If it is to operate in real time, a DSPprocessor must have its architecture optimized for executing DSP

functions.

I/O

Devices

ALU

Shifter

Multiplier

Accumulator

X data

Memory

Y data

Memory

Program

Memory

P Data bus

Y Data bus

X Data bus

Arithmatic Unit

Memory Unit

Figure 2. Basic generic hardware architecture for signal

processing

Figure 2 shows a generic hardware architecture suitable for realtime DSP It is characterized by the following:

Multiple bus structure with separate memory space for dataand program instructions. Typically the data memories holdinput data, intermediate data values and output samples, as

well as fixed coefficients for, for example,digital filters orFFTs. The program instructions are stored in the programmemory.

The I/O port provides a means of passing data to and fromexternal devices such as the ADC and DAC or for passing


4/44

4


digital data to other processors. Direct memory access(DMA), if available, allows for rapid transfer of blocks of datadirectly to or from data RAM, typically under external control.

Arithmetic units for logical and arithmetic operations, whichinclude an ALU, a hardware multiplier and shifters (ormultiplier--accumulator)

Why is such architecture necessary? Most DSP algorithms (suchas filtering correlation and fast Fourier transform) involverepetitive arithmetic operations such as multiply, add, memoryaccesses, and heavy data flow through the CPU. The architecture

of standard microprocessors is not suited to this type of activity.An important goal in DSP hardware design is to optimize both thehardware architecture and the instruction set for DSP operations.In digital signal processors, this is achieved by makingextensive use of the concepts of parallelism. In particular, thefollowing techniques are used:

1. Harvard architecture;

2. pipe-lining;3. fast, dedicated hardware multiplier/accumulator;4. special instructions dedicated to DSP;5. replication;6. on-chip memory/cache;7. extended parallelism SIMD, VLIW and static superscalar

processing.

For successful DSP design, it is important to understand these

key architectural features.


5/44

5


Harvard architecture:The principal feature of the Harvard architecture is that theprogram and data memories lie in two separate spaces,permitting a full overlap of instruction fetch and execution.

Standard microprocessors, such as the Intel 6502, arecharacterized by a single bus structure for both data andinstructions, as shown in Figure 1.

Suppose that in a standard microprocessor we wish to read avalue op I at address ADR 1 in memory into the accumulator andthen store it at two other addresses, ADR2 and ADR3. Theinstructions could be

LDA ADRI load the operand op1 into the accumulator fromADRISTA ADR2 store op1 in address ADR2STA ADR3 store op1 in address ADR3

Typically, each of these instructions would involve three distinctsteps:

instruction fetch; instruction decode; instruction execute.

In our case, the instruction fetch involves fetching the nextinstruction from memory, and instruction execute involves eitherreading or writing data into memory. In a standard processor,without Harvard architecture, the program instructions (that is, the

program code) and the data (operands) are held in one memoryspace; see Figure 3. Thus the fetching of the next instructionwhile the current one is executing is not allowed, because thefetch and execution phases each require memory access.


6/44

6


MPU

Instruction 1

PC

IRLDA ADR1

Instruction 2 STA ADR2

Instruction 3 STA ADR3

Instruction 1

Instruction 1

Instruction 1

ADR1

ADR2ADR3

(a)

Fetch

1

Decode

1LDA ADR1Execute

1

Fetch

2STA ADR2Decode

2

Execute

2

STA ADR3Fetch

3

Decode

3

Execute

3

(b)

Figure 3. An illustration of instructions fetch, decode, and executein a Non-Harward architecture with single memory space.

(a) instruction fetch from memory (b) timing diagram


7/44

7


In a Harvard architecture (Figure 4), since the programinstructions and data lie in separate memory spaces, the fetchingof the next instruction can overlap the execution of the currentinstruction; see Figure 5. Normally, the program memory holds

the program code, while the data memory storesvariables suchas the input data samples.

Digital

Signal

Processor

Program

Memory

Data

Memory

Data memory address bus

Program memory address bus

Program data bus

Data bus

Figure 4.Basic Harvard architecture with separate data andprogram memory spaces.

It may be seen from Figure 4 that data and program instruction

fetches can be overlapped as two independent memories areused in the architecture. This is explained with the help of thetiming diagram as shown in Figure 5 below.


8/44

8


LDA ADR1

STA ADR2

STA ADR3

Clock

Fetch

Fetch

Fetch

Decode

Decode

Decode

Execute

Execute

Execute

Figure 5.An illustration of instruction overlap made possible by

the Harvard architecture.

Strict Harvard architecture is used by some digital signalprocessors (for example Motorola D5P56000), but most use amodified Harvard architecture (for example, the TMS32O family ofprocessors). In the modified architecture used by the TMS32O,for example, separate program and data memory spaces are stillmaintained, but communication between the two memory spacesis permissible, unlike in the strict Harvard architecture.

Pipelining

Pipelining is a technique which allows two or more operations tooverlap during execution. In pipelining, a task is broken down intoa number of distinct subtasks which are then overlapped duringexecution. It is used extensively in digital signal processors toincrease speed. A pipeline is akin to a typical production line in a

factory, such as a car or television assembly plant. As in theproduction line, the task is broken clown into small, independentsubtasks called pipe stages. The pipe stages are connected inseries to form a pipe and the stages executed sequentially.As we have seen in the last example, an instruction can bebroken down into three steps. Each step in the instruction can be


9/44

9


regarded as a stage in a pipeline and so can be overlapped. Byoverlapping the instructions, a new instruction is started at thestart of each clock cycle as shown in Figure 6(a).

Instruction 1

Clock

Instruction 2

Instruction 3

Pipe Stage

1

Pipe Stage

2

Pipe Stage

3

Pipe Stage

1

Pipe Stage

1

Pipe Stage

2

Pipe Stage

2

Pipe Stage

3

Pipe Stage

3

Figure 6 (a)

Figure 6(b) gives the timing diagram for a three-stage pipeline,drawn to highlight the instruction steps. Typically, each step in thepipeline takes one machine cycle.

Instruction fetch

Instruction decode

Instruction execute

i

i-1

i-2

i+1

i

i

i+2

i+1

i+1

i+2

i+2i-1

Clock

Figure 6 (b)

Thus during a given cycle up to three different instructions may beactive at the same time, although each will be at a different stageof completion. The key to an instruction pipeline is that the three


10/44

10


parts of the instruction (that is, fetch, decode and execute) areindependent and so the execution of multiple instructions can beoverlapped. In Figure 6(b), it is seen that, at the ith cycle, theprocessor could be simultaneously fetching the ith instruction,

decoding the (i - 1)th instruction and at the same time executingthe (i -2)th instruction, which are then overlapped duringexecution. It is used extensively in digital signal processors toincrease speed.

Figure 6(b) gives the timing diagram for a three-stage pipeline,drawn to highlight the instruction steps. Typically, each step in thepipeline takes one machine cycle. Thus during a given cycle up to

three different instructions may be active at the same time,-although each will be at a different stage of completion. The keyto an instruction pipeline is that the three parts of the instruction(that is, fetch, decode and execute) are independent and so theexecution of multiple instructions can be overlapped. In Figure6(b), it is seen that, at the ith cycle, the processor could besimultaneously fetching the ith instruction, decoding the (i-1)thinstruction and at the same time executing the (i -2)th instruction.

The threestage pipelining discussed above is based on thetechnique used in the Texas Instruments TMS320 processors. Asin other applications of pipelining, in the TMS 320 a number ofregisters are used to achieve the pipeline: a pre-fetch counterholds the address of the next instruction to be fetched, aninstruction register holds the instruction to be executed, and aqueue instruction register stores the instructions to be executed if

the current instruction is still executing. The program countercontains the address of the next instruction to execute.

By exploiting the inherent parallelism in the instruction stream,pipelining leads to a significant reduction on average, of theexecution time per instruction. The throughput of a pipeline


11/44

11


machine is determined by the number of instructions through thepipe per unit time. As in a production line, all the stages in thepipeline must be synchronized. The t ime for moving aninst ruc t ion from one step to another wi th in the pipe (see

Figure 6(a)) is on e cycle and depend s on the slowest s tage inthe pipeline. In a perfect pipeline, the average time per

inst ruc t ion is given by

time per instruction (non-pipeline) / number of pipe stages (1)

In the ideal case, the speed increase is equal to the number ofpipe stages. In practice, the speed increase will be less because

of the overheads in setting up the pipeline and delays in thepipeline registers, and so on.

Example lIn a non-pipeline machine, the instruction fetch, decode, andexecute take 35ns 25 ns, and 40 ns, respectively. Determine theincrease in throughput if the instruction steps were pipelined.Assume a 5ns pipeline overhead at each stage, and ignore otherdelays.

In the non-pipeline machine, the average instruction time is simplythe sum of the execution time of all the steps: 35 + 25 + 40 ns =100 ns. However, if we assume that the processor has a fixedmachine cycle with the instruction steps synchronized to thesystem clock, then each instruction would take three machinecycles to complete:

40 ns x 3 = 120 ns. (Since the slowest cycle is 40 ns)This corresponds to a throughput of 8.3x I06instructions per second.

In the pipeline machine, the clock speed is determined by thespeed of the slowest stage plus overheads.In our case, the


12/44

12


machine cycle is 40 + 5 = 45 ns. This places a limit on theaverage instruction execution time. The throughput (when thepipeline is full) is 22.2 x10

6instructions per second. Then

speedup = average instruction time (non-pipeline) /averageinstruction time (pipeline)

= 120/45= 2.67 times (assuming non-pipeline executes in three

cycles)

In the pipeline machine, each instruction still takes three clock

cycles, but at each cycle the processor is executing up to threedifferent instructions. Pipelining increases the system throughput,but not the execution time of each instruction on its own.Typically, there is a slight increase in the execution time of eachinstruction because of the pipeline overhead.

Pipelining has a major impact on the system memory. Thenumber of memory accesses in a pipeline machine increases,essentially by the number of stages. In DSP the use of Harvardarchitecture, where data and instructions lie in separate memoryspaces, promotes pipelining.

When a slow unit, such as a data memory, and an arithmeticelement are connected in series, the arithmetic unit often waitsidly for a good deal of the time for data. Pipelining may be used insuch cases to allow a better utilization of the arithmetic unit. Thenext example illustrates this concept.

Example 2Most DSP algorithms are characterized by multiply-andaccumulate operations typified by the following equation:


13/44

13


a0x(n) + a1x(n- 1)+ a2x(n -2)+. . . +aN-1x(n -(N-I))

Figure 7 shows a non-pipeline configuration for an arithmeticelement for executing the above equation. Assume a transport

delay of 200 ns, 100 ns, and 100 ns, respectively, for the memory,multiplier and accumulator.

aN-1

aN-2

a2

a1

a0

x[n-(N-1)]

x[n-(N-2)]

x(n-2)

x(n-1)

x(n)

Multiplier

Coefficient Memory Data Memory

.

.

.

.

TM = 200 ns

Tx = 200 ns

Ta = 200 ns

Figure 7. Non-pipelined MAC configuration. Products are clockedinto the accumulator every 400 ns.

1. What is the system throughput?

2. Reconfigure the system with pipelining to give a speedincrease of 2: 1, Illustrate the operation of the new


14/44

14


configuration with a timing diagram.

Solution:

1. The coefficients, and the data arrays are stored in memoryas shown in Figure 7. In the non-pipelined mode, thecoefficients and data are accessed sequentially and appliedto the multiplier. The products are summed in theaccumulator. Successive multiplication-accumulation (MAC)will be performed once every 400 ns (200 + 100 + 100),giving a throughput of2.5x 106 operations per second.

2. The arithmetic operations involved can be broken up intothree distinct steps: memory read, multiply, and accumulate.To improve speed these steps can be overlapped. A speedimprovement of 2:1 can be achieved by inserting pipelineregisters between the memory and multiplier and betweenthe multiplier and accumulator as shown in Figure 8


15/44

15


aN-1

aN-2

a2

a1

a0

x[n-(N-1)]

x[n-(N-2)]

x(n-2)

x(n-1)

x(n)

Pipeline

Register

Multiplier

Coefficient Memory Data Memory

.

.

.

.

Pipeline

Register

Product

Register

Figure 8. Pipelined MAC configuration.The pipeline registers serve as temporary store for coefficient anddata sample pair. The product register also serves as a temporary

store for the product.


16/44

16


The timing diagram for the pipeline configuration is shown inFigure 9. As is evident in the timing diagram, the MAC is

performed once every 200 ns. The limiting factor is the basictransport delay through the slowest element, in this case thememory. Pipeline overheads have been ignored.

Clock

1st MAC

2nd

MAC

3rd

MAC

Read

Read

Read

Multiply

Multiply

Multiply

Accumulate

Accumulate

Accumulate

x(0)

x(1)

x(2)

a0x(0) 0+a0x(0)

a1x(1) a0x(0)+a1x(1)

a2x(2) a0x(0)+a1x(1)

+a2x(2)

Figure 9. Timing diagram for a pipelined MAC unit.When the pipeline is full, a MAC operation is performed every

clock cycle (200 ns).

DSP algorithms are often repetitive but highly structured, makingthem well suited to multilevel pipelining. For example, FFT

requires the continuous calculation of butterflies. Although eachbutterfly requires different data and coefficients the basic butterflyarithmetic operations are identical. Thus arithmetic units such asFFT processors can be tailored to take advantage of this.Pipelining ensures a steady flow of instructions to the CPU, and ingeneral leads to a significant increase in system throughput.


17/44

17


However, on occasions pipelining may cause problems. Forexample, in some digital signal processors, pipelining may causean unwanted instruction to be executed, especially near branchinstructions, and the designer should be aware of this possibility.

Hardware multiplieraccumulator:

The basic numerical operations in DSP are multiplications andadditions. Multiplication, in software, is notoriously timeconsuming. Additions are even more time consuming if floatingpoint arithmetic is used. To make real-time DSP possible a fast,dedicated hardware multiplier-accumulator (MAC) using fixed or

floating point arithmetic is mandatory. Fixed or floating hardwareMAC: is now standard in all digital signal processors. In a fixedpoint processor, the hardware multiplier typically accepts two I 6-bit 2s complement fractional numbers and computes a 32-bitproduct in a single cycle (25 ns typically) The average MACinstruction time can be significantly reduced through the use ofspecial repeat instructions.


18/44

18


A typical DSP hardware MAC configuration is depicted in Figure10. In this configuration the multiplier has a pair of input registersthat hold the inputs to the multiplier, and a 32-bit product registerwhich holds the result of a multiplication. The output of the P

(product) register is connected to a double-precision accumulator,where the products are accumulated.

X register Y register

P register

R register

//

/

/

16 16

32

32

X data Y data

Figure 10. A typical MAC configuration in DSPs.


19/44

19


The principle is very much the same for hardware floating-pointmultiplier - accumulators, except that the inputs and products arenormalized floating- point numbers. Floating-point MACs allowfast computation of DSP results with minimal errors. The DSP

algorithms such as FIR and IIR filtering suffer from the effects offinite word-length (coefficient quantization and arithmetic errors).Floating point offers a wide dynamic range and reduced arithmeticerrors, although for many applications the dynamic rangeprovided by the fixed-point representation is adequate.

General-purpose digital signal processors:

General-purpose digital signal processors are basically highspeed microprocessors with hardware architectures andinstruction sets optimized for DSP operations. These processorsmake extensive use of parallelism, Harvard architecture,pipelining and dedicated hardware whenever possible to performtime-consuming operations, such as shifting/scaling,multiplication, and so on.

General-purpose DSPs have evolved substantially over the lastdecade as a result of the never-ending quest to find better waysto perform DSP operations, in terms of computational efficiency,ease of implementation, cost, power consumption, size, andapplication-specific needs. The insatiable appetite for improvedcomputational efficiency has led to substantial reductions ininstruction cycle times and, more importantly, to increasingsophistication in the hardware and software architectures. It isnow common to have dedicated, on-chip arithmetic hardware

units (e.g. to support fast multiply / accumulate operations), largeon-chip memory with multiple access and special instructions forefficient execution of inner core computations in DSP. There isalso a trend towards increased data word sizes (e.g. to maintainsignal quality) and increased parallelism (to increase both thenumber of instructions executed in one cycle and the number of


20/44

20


operations performed per instruction). Thus, in newer general-purpose DSP processors increasing use is made of multiple datapaths/arithmetic to support parallel operations. DSP processorsbased on SIMD (Single Instruction, Multiple Data), VLIW (Very

Large Instruction Word) and superscalar architectures are beingintroduced to support efficient parallel processing. In some DSPs,performance is enhanced further by using specialized, on-chip co-processors to speed up specific DSP algorithms such as FIRfiltering and Viterbi decoding. The explosive growth incommunications and digital audio technologies has had a majorinfluence in the evolution of DSPs, as has growth in embeddedDSP processor applications.

Fixed Point Digital Signal Processors:

Fixed-point DSP processors available today differ in their detailedarchitecture and the onboard resources provided. A summary ofkey architectures ot four generations of fixed-paint- DSPprocessors from four leading semiconductor manufacturers isgiven in Table 1. The classification of DSP processors into thefour generations is partly based on historical reasons,architectural features, and computational performance.

The basic architecture of the first generation fixed-point DSPprocessor family (TMS32OCIx), first introduced in 1982 by Texasinstruments, is depicted in Figure 11.


21/44

21


32-bitaccumulator

DataMemory

ProgramMemory

MUX

Program memory bus

Data bus

Data bus

16 x 16 bitmultiplier

Input registers

32-bit ALU

16

16 16

16

16

16

Figure 11 A simplified architecture of a first generation fixed-pointDSP processor (Texas Instruments TMS32OCIO).

Key features of the TMS32OCIx are the dedicated arithmetic unitswhich include a multiplier and an accumulator. The processorfamily has a modified Harvard architecture with two separatememory spaces for programs and data. It has an on-chip memory


22/44

22


and special instructions for execution of basic DSP algorithms,although these are limited.

Second generation fixed-point DSPs have substantially enhanced

features as compared to the first generation. In most cases, theseinclude much larger on-chip memories and more specialinstructions to support efficient execution of DSP algorithms. As aresult, the computational performance of second generation DSPprocessors is four to six times that of the first generation.

Typical second generation DSP processors include TexasInstruments TMS320C5x, Motorola DSP5600x, Analog Devices

ADSP2 I xx and Lucent Technologies DSPI6xx families. TexasInstruments first and second generation DSPs have a lot incommon, architecturally, but second generation DSPs have morefeatures and increased speed. The internal architecture thattypifies the TMS320C5x family of processors is shown in Figure12 in a simplified form to emphasize the dual internal memoryspaces which are characteristic of the Harvard architecture.


23/44


24/44

24


The Motorola DSP5600x processor is a high-precision fixed pointdigital signal processor. Its architecture is depicted in Figure 13.

Program

MemoryROM/RAM

X data

memory

Y data

memory

24-bit X data bus

24-bit Y data bus

24-bit global data bus

Internal

data

paths

24 x 24/56-bit

MAC

Two 56-bit

Accumulators

Arithmetic units

Data

Bus

switch24-bit

External

Data

Bus

24-bit data bus

Figure 13. A simplified architecture of a second generation fixed-point DSP (Motorola D5P56002).

Internally, it has two independent data memory spaces, the X-data and Y- data memory spaces, and one program memoryspace. Having two separate data memory spaces allows a naturalpartitioning of data for DSP operations and facilitates the


25/44

25


execution of the algorithm. For example, in graphics applicationsdata can be stored as X and Y data, in FIR filtering as coefficientsand data, and in FFT as real and imaginary. During programexecution, pairs of data samples can be fetched or stored in

internal memory simultaneously in one cycle. Externally, the twodata spaces are multiplexed into a single data bus, reducingsomewhat the benefits of the dual internal data memory. Thearithmetic units consist of two 56-bit accumulators and a singlecycle, fixed-point hardware multiplier-accumulator (MAC). TheMAC accepts 24-bit inputs and produces a 56-bit product. The 24-bit word length provides sufficient accuracy for representing mostDSP variables while the 56-bit accumulator (including eight guard

bits) prevents arithmetic overflows. These word lengths areadequate for most applications, including digital audio, whichimposes stringent requirements. The 5600x processors providespecial instructions that allow zero-overhead looping and bitreversed addressing capability for scrambling input data beforeFFT or unscrambling the fast Fourier transformed data.

Analog Devices ADS P2 lxx is another family of second

generation fixed- point DSP processors with two separateexternal memory spaces - one holds data only, and the otherholds program code as well as data. A simplified block diagram ofthe internal architecture of the ADSP2 lxx is depicted in Figure 14.


26/44

26


Program

Memory

Data

Memory

Memory units

ALU MAC

Arithmetic units

Shifter

Program memory

path (24-bits)

data memory

path (16-bits)

Figure 14. A simplified architecture of a second generation fixed-point DSP (Analog Devices ADS P2100).

The main components are the ALU, multiplier--accumulator, andshifters. The MAC accepts 16 x 16-bit inputs and produces a 32-bit product in one cycle. The accumulator of the ADSP2 lxx haseight guard bits which may be used for extended precision. TheADSP2 1 xx departs from the strict Harvard architecture, as itallows the storage of both data and program instructions in theprogram memory. A signal line (data access signal) is used toindicate when data and not program instructions are beingfetched from the program memory. Storage of data in the programmemory inhibits asteady data flow through the CPU as data and

instruction fetches cannot occur simultaneously. To avoid abottleneck, the ADSP2 I xx family has an on-chip programmemory cache which holds the last 16 instructions executed. Thiseliminates the need, especially when executing program loops, forrepeated instruction fetches from program memory. The ADSP2lxx provides special instructions for zero-overhead looping and


27/44

27


supports a bit-reversing addressing facility for FFT. The processorfamily has a large on-chip memory (up to 64 Kbytes of internalRAM is provided for increased data transfer). The processor hasan excellent support for DMA. External devices can transfer data

and instructions to or from the DSP processor RAM withoutprocessor intervention.

Lucent Technologies DSP l6xx family of fixed-point DSPs (seeFigure 15) is targeted at the telecommunications and modemmarket.

Program

memory

Data

memory

Cache

Memory units

16-bits X data bus

16-bits Y data bus

16 x 16 bits

multiplier

ALU

Two 36-bits

accumulators

Arithmetic units

Figure 15. A simplified architecture of Lucent Technologies DSPl6xx fixed-point DSP.


28/44

28


In terms of computational performance, it is one of the mostpowerful second generation processors. The processor has aHarvard architecture, and like most of the other secondgeneration processors, it has two data paths, the X and Y data

paths. Its data arithmetic units include a dedicated 16 x 16- bitmultiplier, a 36-bit ALU/shifter (which includes four guard bits) anddual accumulators. Special instructions such as those for zero-overhead single and block instruction looping are provided.

Third generation fixed point DSPs are essentially enhancementsof second generation DSPs. In general, performanceenhancements are achieved by increasing and/or making more

effective use of available On-Chip resources. Compared to thesecond generation DSPs, features of the third generation DSPsinclude more data paths (typically three compared to two in thesecond generation), wider data paths, larger on-chip memory andinstruction cache and in some cases a dual MAC. As a result, theperformance of third generation DSPs is typically two or threetimes superior to that of the second generation DSP processors ofthe same family. Simplified architectures of three third generationDSP processors, TMS320C54x, DSP563x and DSPI6000, aredepicted in Figures 16, 17 and 18.


29/44

29


16 K word

Program

ROM

8 K word

Prog /data

RAM

24 K word

Prog /data

RAM

17 x 17-bit

multiplier

40-bit

adder

Round/

Scale

40 bit

ALU

Viterbi

accelerator

2 x 40 bit

acculumulator

40-bit

shifter

MAC ALU

Arithmetic units

M

U

L

T

I

P

L

E

D

A

T

A

B

U

S

Program data bus

C data bus

D data bus

Figure 16. A simplified architecture of a third generation fixed-point DSP (Texas Instruments TMS320C54x)


30/44

30


Program

cache

4 K words

X data

RAM

2 K words

Y data

RAM

2 K words

Memory units

Program data bus

24 x 24-bitsMAC

2 x 56-bits

accumulator

Shifter

Data ALU

X data bus

Y data bus

Figure 17. A simplified architecture of a third generation fixed-point DSP (Motorola DSP56300).


31/44

31


MAC

16 x 16

MAC

16 x 16

ALU Adder

Eight 40-bits accumulator

Arithmetic unit

Program

memory

Data

memory

Memory units

32-bit X data bus

32-bit Y data bus

Figure 18. A simplified architecture of a third generatIon fixed-

point DSP (Lucent Technologies DSP 16000).


32/44

32


Most of the third generation fixed point DSP Processors areaimed at applications in digital communication and digital audio,reflectingthe enormous growth and influence of these applicationareas on DSP processor development. Thus there are features in

some of the processors that support these applications. In thethird generation processors, semiconductor manufacturer havealso taken the issue of power consumption seriously because ofits application in portable and hand held devices.

Fourth generation fixed point processors with their newarchitectures are primarily aimed at large and/or emerging multichannel applications, such as digital subscribers loop, remote

access server modems, wireless base stations third generationmobile systems and medical imaging. The new fixed pointarchitecture that has attracted a great deal of attention in the DSPcommunity is the very long instruction word (VLIW). The newarchitecture makes extensive use of parallelism whilst retainingsome of the good features of previous DSP processors.Compared to previous generations, fourth generation fixed pointDSP processors, in general, have wider instructionwords, wider

data paths, more registers, larger Instruction cache and multiplearithmetic units, enabling them to execute many more instructionsand operations per cycle.

Texas Instruments TMS320C62x family of fixed-point DSPprocessors is based on the VLIW architecture as shown inFigure 19.


33/44

33


Program

RAM

Data

RAM

Data path 1

Register file 1

L1 S1 M1 D1

Data path 2

Register file 2

L2 S2 M2 D2

Instructions fetch, dispatch and decode

256-bits program data bus

32-bits data bus A

32-bits data bus B

On-chip memory units

Figure 19. A simplified architecture of a fourth generationfixed-point, very long instruction word, DSP processor (Texas

Instruments TMS320C62x). Note the two independent arithmetic

data paths, each with four execution units -L1, S1, M1 and D1;L2, S2, M2 and D2.

The core processor has two independent arithmetic paths, eachwith four execution units - a logic unit (Li), a shifter/logic unit (Si),a multiplier (Mi)and a data address unit (Di). Typically, the core


34/44

34


processor fetches eight 32- bit instructions at a time, giving aninstruction width of 256 bits (and hence the term very longinstruction word). With a total of eight execution units four in eachdata path, the TMS320C62x can execute up to eight instructions

in parallel in one cycle. The processor has a large program anddata cache memories (typically, 4 Kbyte of level 1 program/datacaches and 64 Kbyte of level 2 program/data cache). Each datapath has its own register file (sixteen 32-bit registers), but canalso access registers on the other data path. Advantages of VLIWarchitectures include simplicity and high computationalperformance. Disadvantages include increased program memoryusage (organization of codes to match the inherent parallelism of

the processor may lead to inefficient use of memory). Further,optimum processor performance can only be achieved when allthe execution units are busy which is not always possiblebecause of data dependencies, instruction delays and restrictionsin the use of the execution units. However, sophisticatedprogramming tools are available for code packing, instructionscheduling, resource assignment, and in general to exploit thevast potential of the processor.

Floating-point digital signal processors:

The ability of DSP processors to perform high speed, highprecision DSP operations using floating point arithmetic has beena welcome development. This minimizes finite word length effectssuch as overflows, round-off errors, and coefficient quantizationerrors inherent in DSP. It also facilitates algorithm development,as a designer can develop an algorithm on a large computer in a

high level language and then port II to a DSP device more readilythan with a fixed point.

Floating-point DSP processors retain key features of fixed-pointprocessors such as special instructions for DSP operations andmultiple data paths for multiple operations. As in the case of fixed-


35/44

35


point DSP processors, floating point DSP processors availableare significantly different architecturally.

The TMS320C3x is perhaps the best known family of first

generation general- purpose floating-point DSPs. The C3xfamilyare 32-bit single chip digital signal processors and support bothinteger and floating-point arithmetic operations. They have a largememory space and are equipped with many on-chip peripheralfacilities to simplify system design. These include a programcache to improve the execution of commonly used codes, and on-chip dual access memories. The large memory spaces cater formemory intensive applications, for example graphics and image

processing. In the TMS320C30, a floating-point multiplicationrequires 32-bit operands and produces a 40-bit normalizedfloating-point product. Integer multiplication requires 24-bit inputsand yields 32-bit results. Three floating- point formats aresupported. The first is a 16-bit short floating-point format, with 4-bit exponents, 1 sign bit and 11bits for mantissa. This format is forimmediate floating-point operations. The second is a single-precision format with an 8-bit exponent, 1 sign bit and 23-bitfractions (32 bits in total). The third is a 40-bit extended precisionformat which has an 8-bit exponent, 1 sign bit and 31-bit fractions.The floating-point representation differs from that of standardIEEE, but facilities are provided to allow conversion between thetwo formats. The TMS320C3x combines the features of Harvardarchitecture (separate buses for program instructions, data andI/O) and Von Neumann processor (unified address space).

The emphasis in the second generation, general-purpose floating-

point DSPs is on multiprocessing and multiprocessor support. Keyissues in multiprocessor support include inter-processorcommunication, DMA transfers and global memory sharing. Thebest known second generation floating-point DSP families areTexas instruments TMS320C4x and Analog Devices ADSP-2106x SHARC (Super Harvard Architecture Computer). The C4x


36/44

36


shares some of the architectural features of the C3x, but it wasdesigned for multiprocessing. The C40x family has good I/Ocapabilities it has six COMM ports for inter-processorcommunication and six 32-bit wide DMA channels for rapid data

transfers. The architecture allows multiple operations to beperformed in parallel in one instruction cycle. The C4x familysupports both floating- and fixed-point arithmetic. The nativefloating-point data format in. the C40 differs from the IEEE754/854 standard, although conversion between them can bereadily accomplished.

Analog Devices ADSP-2106x SHARC DSP processors are also

32-bit floating- point devices. They have large internal memoryand impressive 1/0 capability 10 DMA channels to allowaccess to internal memory without intervention and six Link portsfor inter-processor communications at high speed. Thearchitecture allows shared global memory, making it possible forup to six SHARC processors to access each others internal RAMat up to full data rate. The ADSP-2106x family supports both thefixed-point and floating-point arithmetic. Its single precisionfloating-point format complies with the single precision IEEE754/854 floating-point standard (24-bit mantissa and 8-bitexponent). The architecture also supports multiple operations percycle.

Third generation floating-point DSP processors take the conceptsof parallelism much farther to increase both the number ofinstructions and the number of operations in a cycle to meet thechallenges of multichannel and computationally intensive

applications. This is achieved by the use of new architectures, theVLRV (very long instruction word) and superscalar architecturesin particular. The two leading third generation floating-point DSPprocessor families are the Texas Instruments TMS320C67x andAnalog Devices ADSP-TS001. The TMS320C67x family has the


37/44

37


same VLIW architecture as the advanced, fourth generation fixed-point DSP processors, TMS320C62x.

The Tiger SHARC DSP family supports mixed arithmetic types

(fixed and floating point arithmetic) and data types (8-, 16-, and32-bit numbers). This flexibility makes it possible to use thearithmetic and data type most appropriate for a given applicationto enhance performance. As with the TMS320C67x, the TigerSHARC is aimed at large-scale, multi-channel applications, suchas the third generation mobile systems (3G wireless), digitalsubscriber lines (xDSL) and remote, multiple access servermodems for Internet services. Tiger SHARC, with its static

superscalar architecture, combines the good features of VLIWarchitecture, conventional DSP architecture, and RISCcomputers. The processor has two computation blocks, each witha multiplier, ALU and 64-bit shifter. The processor can execute upto eight MAC operations per cycle with 16-bit inputs and 40-bitaccumulation, two 40-bit MACs on 16-bit complex data or two 80-bit MACs with 32-bit data. With 8-bit data, Tiger SHARC can issueup to 16 operations in a cycle. Tiger SHARC has a wide memorybandwidth, with its memory organized in three 128-bit wide banks.Access to data can be in variable data sizes - normal 32-bitwords, long 64-bit words or quad 128-bit words. Up to four 32-bitinstructions can be issued in one cycle. To avoid the use of largeNOPs (which is a disadvantage of VLIW designs), the largeinstruction words may be broken down into separate shortinstructions which are issued to each unit independently.


38/44

38


Selecting digital signal processors:

The choice of a DSP processor for a given application hasbecome an important issue in recent years because of the wide

range of processors available (Levy. 1999; Berkeley DesignTechnology, 1996, 1999). Specific factors that may be consideredwhen selecting a DSP processor for an application includearchitectural features, execution speed, type of arithmetic andword length.

1. Architectural features

Most DSP processors available today have goodarchitectural features, but these may not be adequate for aspecific application. Key features of interest include size ofon-chip memory, special instructions and I/O capability. On-chip memory is an essential requirement in most real timeDSP applications for fast access to data and rapid programexecution. For memory hungry applications (e.g. digitalaudio , FAX/Modem, MPEG coding/decoding), the size ofinternal RAM may become an important distinguishingfactor. Where internal memory is insufficient this can beaugmented by high speed, off-chip memory, although thismay add to system costs. For applications that require fastand efficient communication or data flow with the outsideworld, I/O features such interface to ADC and DACs, DMAcapability and support for multiprocessing may be important.Depending on the application, a rich set of specialinstructions to support DSP operations are important, e.g.

zero-overhead looping capability, dedicated DSPinstructions, and circular addressing.


39/44

39


2. Execution speed

Speed of DSP processors is an important measure ofperformance because of the time-critical nature of most DSP

tasks. Traditionally, the two main units of measurement forthis are the clock speed of the processor, in MHz, and thenumber of instructions performed, in millions of instructionsper second (MIPS) or, in the case of floating-point DSPprocessors, in millions of floating-point operations persecond (MFLOPS). However, such measures may beinappropriate in some cases because of significantdifferences in the way different DSP processors operate,

with most able to perform multiple operations in one machineinstruction. For example, the C62x family of processors canexecute as many as eight instructions in a cycle. Thenumber of operations performed in each cycle also differsfrom processor to processor. Thus, comparison of executionspeed of processors based on such measures may not bemeaningful. An alternative measure is based on theexecution speed of bench-mark algorithms - e.g. DSPkernels such as FFT, FIR and IIR filters (Levy, 1998Berkeley Design Technology, 1999).

3. Type of arithmetic

The two most common types of arithmetic used in modernDSP processors are fixed- and floating-point arithmetic.Floating arithmetic is the natural choice for applications withwide and variable dynamic range requirements (dynamic

range may be defined as the difference between the largestand smallest signal levels that can be represented or thedifference between the largest signal and the noise floor,measured in decibels). Fixed- point processors are favoredin low cost, high volume applications (e.g. celIular phonesand computer disk drives). The use of fixed-point arithmetic


40/44

40


raises issues associated with dynamic range constraintswhich the designer must address. In general, floatingprocessors are more expensive than fixed-point processors,although the cost difference hasfallen significantly in recent

years. Most floating-point DSP processors available todayalso support fixed-point arithmetic.

4. Word length

Processor data word length is an important parameter inDSP as it can have a significant impact on signal quality, itdetermines how accurately parameters and results of DSP

operations can be represented. In general, the longer thedata word the lower the errors that are introduced by digitalsignal processing. In fixed-point audio processing, forexample, a processor word length of at least 24 bits isrequired to keep the smallest signal level sufficiently abovethe noise floor generated by signal processing to maintainCD quality. A variety of processor word lengths are used infixed-point DSP processors, depending on application .Fixed-point DSP processors aimed at telecommunicationsmarkets tend to use a 16-bit word length (e.g.TMS320C54x), whereas those aimed at high quality audioapplications tend to use 24 bits (e.g. DSP56300). In recentyears there is a trend towards the use of more bits for theADC and DAC (e.g. Cirrus 24-bit audio codec, CS4228) asthe cost of these devices falls to meet the insatiable demandfor increased quality. Thus, there is likely to be an increaseddemand for larger processor word lengths for audio

processing. In fixed-point processors, it may also benecessary to provide guard bits (typically I to 8 bits) in theaccumulators to prevent arithmetic overflows duringextended multiply and accumulate operations. The extra bitseffectively extend the dynamic range available in the DSPprocessor. In most floating- point DSP processors, a 32-bit


41/44

41


data size (24-bit mantissa and 8-bit exponent) is used forsingle-precision arithmetic. This size is also compatible withthe IEEE floating-point format (IEEE 754). Most floating-point DSP processors also have fixed-point arithmetic

capability, and often support variable data size, fixed-pointarithmetic.


42/44

42


TMS320C6416 DSP Board


43/44

43


TMS320C6416 DSP Board


44/44

44

Functional block and DSP core diagram for TMS320C6416 DSP

DSP Hardware

Documents

Transcript of DSP Hardware