DSP Hardware

download DSP Hardware

of 44

Transcript of DSP Hardware

  • 7/27/2019 DSP Hardware

    1/44

    1

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    DSP Hardware

    Introduction:

    Since their introduction in the early l980s, DSP processors havegrown substantially in complexity and sophistication to enhancetheir capability and range of applicability. This has also led to asubstantial increase in the number of DSP processors available.To reflect this, features of successive generations of fixed andfloatingpoint DSP processors and factors that affect choice ofDSP processors are considered in the following pages.

    For convenience, DSP processors can be divided into two broadcategories: general purpose and special purpose. DSPprocessors include fixed-point devices such as Texas InstrumentsTMS320C54x, and Motorola DSP563x processors, and floating-point processors such as Texas Instruments TMS320C4x andAnalog Devices ADSP21xxx SHARC processors.

    There are two types of special-purpose hardware,

    1. Hardware designed for efficient execution of specific DSPalgorithms such as digital filters, Fast Fourier Transform.This type of special- purpose hardware is sometimes calledan algorithm-specific digital signal processor.

    2. Hardware designed for specific applications: for exampletelecommunications, digital audio, or control applications.This type of hardware is sometimes called an application-

    specific digital signal processor.

    In most cases application-specific digital signal processorsexecute specific algorithms, such as PCM encoding/decoding, butare also required to perform other application-specific operations.Examples of special-purpose DSP processors are Cirruss

  • 7/27/2019 DSP Hardware

    2/44

    2

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    processor for digital audio sampling rate converters (CS8420),Intels multi- channel telephony voice echo canceller (MT9300),FFT processor (PDSPI65I5A) and programmable FIR filter(VPDSP 16256).

    Both general-purpose and special-purpose processors can bedesigned with single chips or with individual blocks of multipliers,ALUs, memories, and so on. First, we will discuss thearchitectural features of digital signal processors that have madereal-time DSP in many areas possible.

    Most general purpose processors available today are based on

    the Von Neumann concepts, where operations are performedsequentially. Figure 1 shows a simplified architecture for astandard Von Neumann processor. When an instruction isprocessed in such a processor, units of the processor notinvolved at each instruction phase wait idly until control is passedon to them.

    Address

    Generator

    I/O

    Devices

    Accumulator

    Multiplier

    Product

    Register

    Program

    and data

    Memory

    Address bus

    Data bus

    ALU

    Figure 1. A simplified architecture for standard microprocessor

  • 7/27/2019 DSP Hardware

    3/44

    3

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    Increase in processor speed is achieved by making theindividual units operate faster, but there is a limit on how fastthey can be made to operate. If it is to operate in real time, a DSPprocessor must have its architecture optimized for executing DSP

    functions.

    I/O

    Devices

    ALU

    Shifter

    Multiplier

    Accumulator

    X data

    Memory

    Y data

    Memory

    Program

    Memory

    P Data bus

    Y Data bus

    X Data bus

    Arithmatic Unit

    Memory Unit

    Figure 2. Basic generic hardware architecture for signal

    processing

    Figure 2 shows a generic hardware architecture suitable for realtime DSP It is characterized by the following:

    Multiple bus structure with separate memory space for dataand program instructions. Typically the data memories holdinput data, intermediate data values and output samples, as

    well as fixed coefficients for, for example,digital filters orFFTs. The program instructions are stored in the programmemory.

    The I/O port provides a means of passing data to and fromexternal devices such as the ADC and DAC or for passing

  • 7/27/2019 DSP Hardware

    4/44

    4

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    digital data to other processors. Direct memory access(DMA), if available, allows for rapid transfer of blocks of datadirectly to or from data RAM, typically under external control.

    Arithmetic units for logical and arithmetic operations, whichinclude an ALU, a hardware multiplier and shifters (ormultiplier--accumulator)

    Why is such architecture necessary? Most DSP algorithms (suchas filtering correlation and fast Fourier transform) involverepetitive arithmetic operations such as multiply, add, memoryaccesses, and heavy data flow through the CPU. The architecture

    of standard microprocessors is not suited to this type of activity.An important goal in DSP hardware design is to optimize both thehardware architecture and the instruction set for DSP operations.In digital signal processors, this is achieved by makingextensive use of the concepts of parallelism. In particular, thefollowing techniques are used:

    1. Harvard architecture;

    2. pipe-lining;3. fast, dedicated hardware multiplier/accumulator;4. special instructions dedicated to DSP;5. replication;6. on-chip memory/cache;7. extended parallelism SIMD, VLIW and static superscalar

    processing.

    For successful DSP design, it is important to understand these

    key architectural features.

  • 7/27/2019 DSP Hardware

    5/44

    5

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    Harvard architecture:The principal feature of the Harvard architecture is that theprogram and data memories lie in two separate spaces,permitting a full overlap of instruction fetch and execution.

    Standard microprocessors, such as the Intel 6502, arecharacterized by a single bus structure for both data andinstructions, as shown in Figure 1.

    Suppose that in a standard microprocessor we wish to read avalue op I at address ADR 1 in memory into the accumulator andthen store it at two other addresses, ADR2 and ADR3. Theinstructions could be

    LDA ADRI load the operand op1 into the accumulator fromADRISTA ADR2 store op1 in address ADR2STA ADR3 store op1 in address ADR3

    Typically, each of these instructions would involve three distinctsteps:

    instruction fetch; instruction decode; instruction execute.

    In our case, the instruction fetch involves fetching the nextinstruction from memory, and instruction execute involves eitherreading or writing data into memory. In a standard processor,without Harvard architecture, the program instructions (that is, the

    program code) and the data (operands) are held in one memoryspace; see Figure 3. Thus the fetching of the next instructionwhile the current one is executing is not allowed, because thefetch and execution phases each require memory access.

  • 7/27/2019 DSP Hardware

    6/44

    6

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    MPU

    Instruction 1

    PC

    IRLDA ADR1

    Instruction 2 STA ADR2

    Instruction 3 STA ADR3

    Instruction 1

    Instruction 1

    Instruction 1

    ADR1

    ADR2ADR3

    (a)

    Fetch

    1

    Decode

    1LDA ADR1Execute

    1

    Fetch

    2STA ADR2Decode

    2

    Execute

    2

    STA ADR3Fetch

    3

    Decode

    3

    Execute

    3

    (b)

    Figure 3. An illustration of instructions fetch, decode, and executein a Non-Harward architecture with single memory space.

    (a) instruction fetch from memory (b) timing diagram

  • 7/27/2019 DSP Hardware

    7/44

    7

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    In a Harvard architecture (Figure 4), since the programinstructions and data lie in separate memory spaces, the fetchingof the next instruction can overlap the execution of the currentinstruction; see Figure 5. Normally, the program memory holds

    the program code, while the data memory storesvariables suchas the input data samples.

    Digital

    Signal

    Processor

    Program

    Memory

    Data

    Memory

    Data memory address bus

    Program memory address bus

    Program data bus

    Data bus

    Figure 4.Basic Harvard architecture with separate data andprogram memory spaces.

    It may be seen from Figure 4 that data and program instruction

    fetches can be overlapped as two independent memories areused in the architecture. This is explained with the help of thetiming diagram as shown in Figure 5 below.

  • 7/27/2019 DSP Hardware

    8/44

    8

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    LDA ADR1

    STA ADR2

    STA ADR3

    Clock

    Fetch

    Fetch

    Fetch

    Decode

    Decode

    Decode

    Execute

    Execute

    Execute

    Figure 5.An illustration of instruction overlap made possible by

    the Harvard architecture.

    Strict Harvard architecture is used by some digital signalprocessors (for example Motorola D5P56000), but most use amodified Harvard architecture (for example, the TMS32O family ofprocessors). In the modified architecture used by the TMS32O,for example, separate program and data memory spaces are stillmaintained, but communication between the two memory spacesis permissible, unlike in the strict Harvard architecture.

    Pipelining

    Pipelining is a technique which allows two or more operations tooverlap during execution. In pipelining, a task is broken down intoa number of distinct subtasks which are then overlapped duringexecution. It is used extensively in digital signal processors toincrease speed. A pipeline is akin to a typical production line in a

    factory, such as a car or television assembly plant. As in theproduction line, the task is broken clown into small, independentsubtasks called pipe stages. The pipe stages are connected inseries to form a pipe and the stages executed sequentially.As we have seen in the last example, an instruction can bebroken down into three steps. Each step in the instruction can be

  • 7/27/2019 DSP Hardware

    9/44

    9

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    regarded as a stage in a pipeline and so can be overlapped. Byoverlapping the instructions, a new instruction is started at thestart of each clock cycle as shown in Figure 6(a).

    Instruction 1

    Clock

    Instruction 2

    Instruction 3

    Pipe Stage

    1

    Pipe Stage

    2

    Pipe Stage

    3

    Pipe Stage

    1

    Pipe Stage

    1

    Pipe Stage

    2

    Pipe Stage

    2

    Pipe Stage

    3

    Pipe Stage

    3

    Figure 6 (a)

    Figure 6(b) gives the timing diagram for a three-stage pipeline,drawn to highlight the instruction steps. Typically, each step in thepipeline takes one machine cycle.

    Instruction fetch

    Instruction decode

    Instruction execute

    i

    i-1

    i-2

    i+1

    i

    i

    i+2

    i+1

    i+1

    i+2

    i+2i-1

    Clock

    Figure 6 (b)

    Thus during a given cycle up to three different instructions may beactive at the same time, although each will be at a different stageof completion. The key to an instruction pipeline is that the three

  • 7/27/2019 DSP Hardware

    10/44

    10

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    parts of the instruction (that is, fetch, decode and execute) areindependent and so the execution of multiple instructions can beoverlapped. In Figure 6(b), it is seen that, at the ith cycle, theprocessor could be simultaneously fetching the ith instruction,

    decoding the (i - 1)th instruction and at the same time executingthe (i -2)th instruction, which are then overlapped duringexecution. It is used extensively in digital signal processors toincrease speed.

    Figure 6(b) gives the timing diagram for a three-stage pipeline,drawn to highlight the instruction steps. Typically, each step in thepipeline takes one machine cycle. Thus during a given cycle up to

    three different instructions may be active at the same time,-although each will be at a different stage of completion. The keyto an instruction pipeline is that the three parts of the instruction(that is, fetch, decode and execute) are independent and so theexecution of multiple instructions can be overlapped. In Figure6(b), it is seen that, at the ith cycle, the processor could besimultaneously fetching the ith instruction, decoding the (i-1)thinstruction and at the same time executing the (i -2)th instruction.

    The threestage pipelining discussed above is based on thetechnique used in the Texas Instruments TMS320 processors. Asin other applications of pipelining, in the TMS 320 a number ofregisters are used to achieve the pipeline: a pre-fetch counterholds the address of the next instruction to be fetched, aninstruction register holds the instruction to be executed, and aqueue instruction register stores the instructions to be executed if

    the current instruction is still executing. The program countercontains the address of the next instruction to execute.

    By exploiting the inherent parallelism in the instruction stream,pipelining leads to a significant reduction on average, of theexecution time per instruction. The throughput of a pipeline

  • 7/27/2019 DSP Hardware

    11/44

    11

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    machine is determined by the number of instructions through thepipe per unit time. As in a production line, all the stages in thepipeline must be synchronized. The t ime for moving aninst ruc t ion from one step to another wi th in the pipe (see

    Figure 6(a)) is on e cycle and depend s on the slowest s tage inthe pipeline. In a perfect pipeline, the average time per

    inst ruc t ion is given by

    time per instruction (non-pipeline) / number of pipe stages (1)

    In the ideal case, the speed increase is equal to the number ofpipe stages. In practice, the speed increase will be less because

    of the overheads in setting up the pipeline and delays in thepipeline registers, and so on.

    Example lIn a non-pipeline machine, the instruction fetch, decode, andexecute take 35ns 25 ns, and 40 ns, respectively. Determine theincrease in throughput if the instruction steps were pipelined.Assume a 5ns pipeline overhead at each stage, and ignore otherdelays.

    In the non-pipeline machine, the average instruction time is simplythe sum of the execution time of all the steps: 35 + 25 + 40 ns =100 ns. However, if we assume that the processor has a fixedmachine cycle with the instruction steps synchronized to thesystem clock, then each instruction would take three machinecycles to complete:

    40 ns x 3 = 120 ns. (Since the slowest cycle is 40 ns)This corresponds to a throughput of 8.3x I06instructions per second.

    In the pipeline machine, the clock speed is determined by thespeed of the slowest stage plus overheads.In our case, the

  • 7/27/2019 DSP Hardware

    12/44

    12

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    machine cycle is 40 + 5 = 45 ns. This places a limit on theaverage instruction execution time. The throughput (when thepipeline is full) is 22.2 x10

    6instructions per second. Then

    speedup = average instruction time (non-pipeline) /averageinstruction time (pipeline)

    = 120/45= 2.67 times (assuming non-pipeline executes in three

    cycles)

    In the pipeline machine, each instruction still takes three clock

    cycles, but at each cycle the processor is executing up to threedifferent instructions. Pipelining increases the system throughput,but not the execution time of each instruction on its own.Typically, there is a slight increase in the execution time of eachinstruction because of the pipeline overhead.

    Pipelining has a major impact on the system memory. Thenumber of memory accesses in a pipeline machine increases,essentially by the number of stages. In DSP the use of Harvardarchitecture, where data and instructions lie in separate memoryspaces, promotes pipelining.

    When a slow unit, such as a data memory, and an arithmeticelement are connected in series, the arithmetic unit often waitsidly for a good deal of the time for data. Pipelining may be used insuch cases to allow a better utilization of the arithmetic unit. Thenext example illustrates this concept.

    Example 2Most DSP algorithms are characterized by multiply-andaccumulate operations typified by the following equation:

  • 7/27/2019 DSP Hardware

    13/44

    13

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    a0x(n) + a1x(n- 1)+ a2x(n -2)+. . . +aN-1x(n -(N-I))

    Figure 7 shows a non-pipeline configuration for an arithmeticelement for executing the above equation. Assume a transport

    delay of 200 ns, 100 ns, and 100 ns, respectively, for the memory,multiplier and accumulator.

    aN-1

    aN-2

    a2

    a1

    a0

    x[n-(N-1)]

    x[n-(N-2)]

    x(n-2)

    x(n-1)

    x(n)

    Multiplier

    Coefficient Memory Data Memory

    .

    .

    .

    .

    TM = 200 ns

    Tx = 200 ns

    Ta = 200 ns

    Figure 7. Non-pipelined MAC configuration. Products are clockedinto the accumulator every 400 ns.

    1. What is the system throughput?

    2. Reconfigure the system with pipelining to give a speedincrease of 2: 1, Illustrate the operation of the new

  • 7/27/2019 DSP Hardware

    14/44

    14

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    configuration with a timing diagram.

    Solution:

    1. The coefficients, and the data arrays are stored in memoryas shown in Figure 7. In the non-pipelined mode, thecoefficients and data are accessed sequentially and appliedto the multiplier. The products are summed in theaccumulator. Successive multiplication-accumulation (MAC)will be performed once every 400 ns (200 + 100 + 100),giving a throughput of2.5x 106 operations per second.

    2. The arithmetic operations involved can be broken up intothree distinct steps: memory read, multiply, and accumulate.To improve speed these steps can be overlapped. A speedimprovement of 2:1 can be achieved by inserting pipelineregisters between the memory and multiplier and betweenthe multiplier and accumulator as shown in Figure 8

  • 7/27/2019 DSP Hardware

    15/44

    15

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    aN-1

    aN-2

    a2

    a1

    a0

    x[n-(N-1)]

    x[n-(N-2)]

    x(n-2)

    x(n-1)

    x(n)

    Pipeline

    Register

    Multiplier

    Coefficient Memory Data Memory

    .

    .

    .

    .

    Pipeline

    Register

    Product

    Register

    Figure 8. Pipelined MAC configuration.The pipeline registers serve as temporary store for coefficient anddata sample pair. The product register also serves as a temporary

    store for the product.

  • 7/27/2019 DSP Hardware

    16/44

    16

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    The timing diagram for the pipeline configuration is shown inFigure 9. As is evident in the timing diagram, the MAC is

    performed once every 200 ns. The limiting factor is the basictransport delay through the slowest element, in this case thememory. Pipeline overheads have been ignored.

    Clock

    1st MAC

    2nd

    MAC

    3rd

    MAC

    Read

    Read

    Read

    Multiply

    Multiply

    Multiply

    Accumulate

    Accumulate

    Accumulate

    x(0)

    x(1)

    x(2)

    a0x(0) 0+a0x(0)

    a1x(1) a0x(0)+a1x(1)

    a2x(2) a0x(0)+a1x(1)

    +a2x(2)

    Figure 9. Timing diagram for a pipelined MAC unit.When the pipeline is full, a MAC operation is performed every

    clock cycle (200 ns).

    DSP algorithms are often repetitive but highly structured, makingthem well suited to multilevel pipelining. For example, FFT

    requires the continuous calculation of butterflies. Although eachbutterfly requires different data and coefficients the basic butterflyarithmetic operations are identical. Thus arithmetic units such asFFT processors can be tailored to take advantage of this.Pipelining ensures a steady flow of instructions to the CPU, and ingeneral leads to a significant increase in system throughput.

  • 7/27/2019 DSP Hardware

    17/44

    17

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    However, on occasions pipelining may cause problems. Forexample, in some digital signal processors, pipelining may causean unwanted instruction to be executed, especially near branchinstructions, and the designer should be aware of this possibility.

    Hardware multiplieraccumulator:

    The basic numerical operations in DSP are multiplications andadditions. Multiplication, in software, is notoriously timeconsuming. Additions are even more time consuming if floatingpoint arithmetic is used. To make real-time DSP possible a fast,dedicated hardware multiplier-accumulator (MAC) using fixed or

    floating point arithmetic is mandatory. Fixed or floating hardwareMAC: is now standard in all digital signal processors. In a fixedpoint processor, the hardware multiplier typically accepts two I 6-bit 2s complement fractional numbers and computes a 32-bitproduct in a single cycle (25 ns typically) The average MACinstruction time can be significantly reduced through the use ofspecial repeat instructions.

  • 7/27/2019 DSP Hardware

    18/44

    18

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    A typical DSP hardware MAC configuration is depicted in Figure10. In this configuration the multiplier has a pair of input registersthat hold the inputs to the multiplier, and a 32-bit product registerwhich holds the result of a multiplication. The output of the P

    (product) register is connected to a double-precision accumulator,where the products are accumulated.

    X register Y register

    P register

    R register

    //

    /

    /

    16 16

    32

    32

    X data Y data

    Figure 10. A typical MAC configuration in DSPs.

  • 7/27/2019 DSP Hardware

    19/44

    19

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    The principle is very much the same for hardware floating-pointmultiplier - accumulators, except that the inputs and products arenormalized floating- point numbers. Floating-point MACs allowfast computation of DSP results with minimal errors. The DSP

    algorithms such as FIR and IIR filtering suffer from the effects offinite word-length (coefficient quantization and arithmetic errors).Floating point offers a wide dynamic range and reduced arithmeticerrors, although for many applications the dynamic rangeprovided by the fixed-point representation is adequate.

    General-purpose digital signal processors:

    General-purpose digital signal processors are basically highspeed microprocessors with hardware architectures andinstruction sets optimized for DSP operations. These processorsmake extensive use of parallelism, Harvard architecture,pipelining and dedicated hardware whenever possible to performtime-consuming operations, such as shifting/scaling,multiplication, and so on.

    General-purpose DSPs have evolved substantially over the lastdecade as a result of the never-ending quest to find better waysto perform DSP operations, in terms of computational efficiency,ease of implementation, cost, power consumption, size, andapplication-specific needs. The insatiable appetite for improvedcomputational efficiency has led to substantial reductions ininstruction cycle times and, more importantly, to increasingsophistication in the hardware and software architectures. It isnow common to have dedicated, on-chip arithmetic hardware

    units (e.g. to support fast multiply / accumulate operations), largeon-chip memory with multiple access and special instructions forefficient execution of inner core computations in DSP. There isalso a trend towards increased data word sizes (e.g. to maintainsignal quality) and increased parallelism (to increase both thenumber of instructions executed in one cycle and the number of

  • 7/27/2019 DSP Hardware

    20/44

    20

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    operations performed per instruction). Thus, in newer general-purpose DSP processors increasing use is made of multiple datapaths/arithmetic to support parallel operations. DSP processorsbased on SIMD (Single Instruction, Multiple Data), VLIW (Very

    Large Instruction Word) and superscalar architectures are beingintroduced to support efficient parallel processing. In some DSPs,performance is enhanced further by using specialized, on-chip co-processors to speed up specific DSP algorithms such as FIRfiltering and Viterbi decoding. The explosive growth incommunications and digital audio technologies has had a majorinfluence in the evolution of DSPs, as has growth in embeddedDSP processor applications.

    Fixed Point Digital Signal Processors:

    Fixed-point DSP processors available today differ in their detailedarchitecture and the onboard resources provided. A summary ofkey architectures ot four generations of fixed-paint- DSPprocessors from four leading semiconductor manufacturers isgiven in Table 1. The classification of DSP processors into thefour generations is partly based on historical reasons,architectural features, and computational performance.

    The basic architecture of the first generation fixed-point DSPprocessor family (TMS32OCIx), first introduced in 1982 by Texasinstruments, is depicted in Figure 11.

  • 7/27/2019 DSP Hardware

    21/44

    21

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    32-bitaccumulator

    DataMemory

    ProgramMemory

    MUX

    Program memory bus

    Data bus

    Data bus

    16 x 16 bitmultiplier

    Input registers

    32-bit ALU

    16

    16 16

    16

    16

    16

    Figure 11 A simplified architecture of a first generation fixed-pointDSP processor (Texas Instruments TMS32OCIO).

    Key features of the TMS32OCIx are the dedicated arithmetic unitswhich include a multiplier and an accumulator. The processorfamily has a modified Harvard architecture with two separatememory spaces for programs and data. It has an on-chip memory

  • 7/27/2019 DSP Hardware

    22/44

    22

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    and special instructions for execution of basic DSP algorithms,although these are limited.

    Second generation fixed-point DSPs have substantially enhanced

    features as compared to the first generation. In most cases, theseinclude much larger on-chip memories and more specialinstructions to support efficient execution of DSP algorithms. As aresult, the computational performance of second generation DSPprocessors is four to six times that of the first generation.

    Typical second generation DSP processors include TexasInstruments TMS320C5x, Motorola DSP5600x, Analog Devices

    ADSP2 I xx and Lucent Technologies DSPI6xx families. TexasInstruments first and second generation DSPs have a lot incommon, architecturally, but second generation DSPs have morefeatures and increased speed. The internal architecture thattypifies the TMS320C5x family of processors is shown in Figure12 in a simplified form to emphasize the dual internal memoryspaces which are characteristic of the Harvard architecture.

  • 7/27/2019 DSP Hardware

    23/44

  • 7/27/2019 DSP Hardware

    24/44

    24

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    The Motorola DSP5600x processor is a high-precision fixed pointdigital signal processor. Its architecture is depicted in Figure 13.

    Program

    MemoryROM/RAM

    X data

    memory

    Y data

    memory

    24-bit X data bus

    24-bit Y data bus

    24-bit global data bus

    Internal

    data

    paths

    24 x 24/56-bit

    MAC

    Two 56-bit

    Accumulators

    Arithmetic units

    Data

    Bus

    switch24-bit

    External

    Data

    Bus

    24-bit data bus

    Figure 13. A simplified architecture of a second generation fixed-point DSP (Motorola D5P56002).

    Internally, it has two independent data memory spaces, the X-data and Y- data memory spaces, and one program memoryspace. Having two separate data memory spaces allows a naturalpartitioning of data for DSP operations and facilitates the

  • 7/27/2019 DSP Hardware

    25/44

    25

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    execution of the algorithm. For example, in graphics applicationsdata can be stored as X and Y data, in FIR filtering as coefficientsand data, and in FFT as real and imaginary. During programexecution, pairs of data samples can be fetched or stored in

    internal memory simultaneously in one cycle. Externally, the twodata spaces are multiplexed into a single data bus, reducingsomewhat the benefits of the dual internal data memory. Thearithmetic units consist of two 56-bit accumulators and a singlecycle, fixed-point hardware multiplier-accumulator (MAC). TheMAC accepts 24-bit inputs and produces a 56-bit product. The 24-bit word length provides sufficient accuracy for representing mostDSP variables while the 56-bit accumulator (including eight guard

    bits) prevents arithmetic overflows. These word lengths areadequate for most applications, including digital audio, whichimposes stringent requirements. The 5600x processors providespecial instructions that allow zero-overhead looping and bitreversed addressing capability for scrambling input data beforeFFT or unscrambling the fast Fourier transformed data.

    Analog Devices ADS P2 lxx is another family of second

    generation fixed- point DSP processors with two separateexternal memory spaces - one holds data only, and the otherholds program code as well as data. A simplified block diagram ofthe internal architecture of the ADSP2 lxx is depicted in Figure 14.

  • 7/27/2019 DSP Hardware

    26/44

    26

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    Program

    Memory

    Data

    Memory

    Memory units

    ALU MAC

    Arithmetic units

    Shifter

    Program memory

    path (24-bits)

    data memory

    path (16-bits)

    Figure 14. A simplified architecture of a second generation fixed-point DSP (Analog Devices ADS P2100).

    The main components are the ALU, multiplier--accumulator, andshifters. The MAC accepts 16 x 16-bit inputs and produces a 32-bit product in one cycle. The accumulator of the ADSP2 lxx haseight guard bits which may be used for extended precision. TheADSP2 1 xx departs from the strict Harvard architecture, as itallows the storage of both data and program instructions in theprogram memory. A signal line (data access signal) is used toindicate when data and not program instructions are beingfetched from the program memory. Storage of data in the programmemory inhibits asteady data flow through the CPU as data and

    instruction fetches cannot occur simultaneously. To avoid abottleneck, the ADSP2 I xx family has an on-chip programmemory cache which holds the last 16 instructions executed. Thiseliminates the need, especially when executing program loops, forrepeated instruction fetches from program memory. The ADSP2lxx provides special instructions for zero-overhead looping and

  • 7/27/2019 DSP Hardware

    27/44

    27

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    supports a bit-reversing addressing facility for FFT. The processorfamily has a large on-chip memory (up to 64 Kbytes of internalRAM is provided for increased data transfer). The processor hasan excellent support for DMA. External devices can transfer data

    and instructions to or from the DSP processor RAM withoutprocessor intervention.

    Lucent Technologies DSP l6xx family of fixed-point DSPs (seeFigure 15) is targeted at the telecommunications and modemmarket.

    Program

    memory

    Data

    memory

    Cache

    Memory units

    16-bits X data bus

    16-bits Y data bus

    16 x 16 bits

    multiplier

    ALU

    Two 36-bits

    accumulators

    Arithmetic units

    Figure 15. A simplified architecture of Lucent Technologies DSPl6xx fixed-point DSP.

  • 7/27/2019 DSP Hardware

    28/44

    28

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    In terms of computational performance, it is one of the mostpowerful second generation processors. The processor has aHarvard architecture, and like most of the other secondgeneration processors, it has two data paths, the X and Y data

    paths. Its data arithmetic units include a dedicated 16 x 16- bitmultiplier, a 36-bit ALU/shifter (which includes four guard bits) anddual accumulators. Special instructions such as those for zero-overhead single and block instruction looping are provided.

    Third generation fixed point DSPs are essentially enhancementsof second generation DSPs. In general, performanceenhancements are achieved by increasing and/or making more

    effective use of available On-Chip resources. Compared to thesecond generation DSPs, features of the third generation DSPsinclude more data paths (typically three compared to two in thesecond generation), wider data paths, larger on-chip memory andinstruction cache and in some cases a dual MAC. As a result, theperformance of third generation DSPs is typically two or threetimes superior to that of the second generation DSP processors ofthe same family. Simplified architectures of three third generationDSP processors, TMS320C54x, DSP563x and DSPI6000, aredepicted in Figures 16, 17 and 18.

  • 7/27/2019 DSP Hardware

    29/44

    29

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    16 K word

    Program

    ROM

    8 K word

    Prog /data

    RAM

    24 K word

    Prog /data

    RAM

    17 x 17-bit

    multiplier

    40-bit

    adder

    Round/

    Scale

    40 bit

    ALU

    Viterbi

    accelerator

    2 x 40 bit

    acculumulator

    40-bit

    shifter

    MAC ALU

    Arithmetic units

    M

    U

    L

    T

    I

    P

    L

    E

    D

    A

    T

    A

    B

    U

    S

    Program data bus

    C data bus

    D data bus

    Figure 16. A simplified architecture of a third generation fixed-point DSP (Texas Instruments TMS320C54x)

  • 7/27/2019 DSP Hardware

    30/44

    30

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    Program

    cache

    4 K words

    X data

    RAM

    2 K words

    Y data

    RAM

    2 K words

    Memory units

    Program data bus

    24 x 24-bitsMAC

    2 x 56-bits

    accumulator

    Shifter

    Data ALU

    X data bus

    Y data bus

    Figure 17. A simplified architecture of a third generation fixed-point DSP (Motorola DSP56300).

  • 7/27/2019 DSP Hardware

    31/44

    31

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    MAC

    16 x 16

    MAC

    16 x 16

    ALU Adder

    Eight 40-bits accumulator

    Arithmetic unit

    Program

    memory

    Data

    memory

    Memory units

    32-bit X data bus

    32-bit Y data bus

    Figure 18. A simplified architecture of a third generatIon fixed-

    point DSP (Lucent Technologies DSP 16000).

  • 7/27/2019 DSP Hardware

    32/44

    32

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    Most of the third generation fixed point DSP Processors areaimed at applications in digital communication and digital audio,reflectingthe enormous growth and influence of these applicationareas on DSP processor development. Thus there are features in

    some of the processors that support these applications. In thethird generation processors, semiconductor manufacturer havealso taken the issue of power consumption seriously because ofits application in portable and hand held devices.

    Fourth generation fixed point processors with their newarchitectures are primarily aimed at large and/or emerging multichannel applications, such as digital subscribers loop, remote

    access server modems, wireless base stations third generationmobile systems and medical imaging. The new fixed pointarchitecture that has attracted a great deal of attention in the DSPcommunity is the very long instruction word (VLIW). The newarchitecture makes extensive use of parallelism whilst retainingsome of the good features of previous DSP processors.Compared to previous generations, fourth generation fixed pointDSP processors, in general, have wider instructionwords, wider

    data paths, more registers, larger Instruction cache and multiplearithmetic units, enabling them to execute many more instructionsand operations per cycle.

    Texas Instruments TMS320C62x family of fixed-point DSPprocessors is based on the VLIW architecture as shown inFigure 19.

  • 7/27/2019 DSP Hardware

    33/44

    33

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    Program

    RAM

    Data

    RAM

    Data path 1

    Register file 1

    L1 S1 M1 D1

    Data path 2

    Register file 2

    L2 S2 M2 D2

    Instructions fetch, dispatch and decode

    256-bits program data bus

    32-bits data bus A

    32-bits data bus B

    On-chip memory units

    Figure 19. A simplified architecture of a fourth generationfixed-point, very long instruction word, DSP processor (Texas

    Instruments TMS320C62x). Note the two independent arithmetic

    data paths, each with four execution units -L1, S1, M1 and D1;L2, S2, M2 and D2.

    The core processor has two independent arithmetic paths, eachwith four execution units - a logic unit (Li), a shifter/logic unit (Si),a multiplier (Mi)and a data address unit (Di). Typically, the core

  • 7/27/2019 DSP Hardware

    34/44

    34

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    processor fetches eight 32- bit instructions at a time, giving aninstruction width of 256 bits (and hence the term very longinstruction word). With a total of eight execution units four in eachdata path, the TMS320C62x can execute up to eight instructions

    in parallel in one cycle. The processor has a large program anddata cache memories (typically, 4 Kbyte of level 1 program/datacaches and 64 Kbyte of level 2 program/data cache). Each datapath has its own register file (sixteen 32-bit registers), but canalso access registers on the other data path. Advantages of VLIWarchitectures include simplicity and high computationalperformance. Disadvantages include increased program memoryusage (organization of codes to match the inherent parallelism of

    the processor may lead to inefficient use of memory). Further,optimum processor performance can only be achieved when allthe execution units are busy which is not always possiblebecause of data dependencies, instruction delays and restrictionsin the use of the execution units. However, sophisticatedprogramming tools are available for code packing, instructionscheduling, resource assignment, and in general to exploit thevast potential of the processor.

    Floating-point digital signal processors:

    The ability of DSP processors to perform high speed, highprecision DSP operations using floating point arithmetic has beena welcome development. This minimizes finite word length effectssuch as overflows, round-off errors, and coefficient quantizationerrors inherent in DSP. It also facilitates algorithm development,as a designer can develop an algorithm on a large computer in a

    high level language and then port II to a DSP device more readilythan with a fixed point.

    Floating-point DSP processors retain key features of fixed-pointprocessors such as special instructions for DSP operations andmultiple data paths for multiple operations. As in the case of fixed-

  • 7/27/2019 DSP Hardware

    35/44

    35

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    point DSP processors, floating point DSP processors availableare significantly different architecturally.

    The TMS320C3x is perhaps the best known family of first

    generation general- purpose floating-point DSPs. The C3xfamilyare 32-bit single chip digital signal processors and support bothinteger and floating-point arithmetic operations. They have a largememory space and are equipped with many on-chip peripheralfacilities to simplify system design. These include a programcache to improve the execution of commonly used codes, and on-chip dual access memories. The large memory spaces cater formemory intensive applications, for example graphics and image

    processing. In the TMS320C30, a floating-point multiplicationrequires 32-bit operands and produces a 40-bit normalizedfloating-point product. Integer multiplication requires 24-bit inputsand yields 32-bit results. Three floating- point formats aresupported. The first is a 16-bit short floating-point format, with 4-bit exponents, 1 sign bit and 11bits for mantissa. This format is forimmediate floating-point operations. The second is a single-precision format with an 8-bit exponent, 1 sign bit and 23-bitfractions (32 bits in total). The third is a 40-bit extended precisionformat which has an 8-bit exponent, 1 sign bit and 31-bit fractions.The floating-point representation differs from that of standardIEEE, but facilities are provided to allow conversion between thetwo formats. The TMS320C3x combines the features of Harvardarchitecture (separate buses for program instructions, data andI/O) and Von Neumann processor (unified address space).

    The emphasis in the second generation, general-purpose floating-

    point DSPs is on multiprocessing and multiprocessor support. Keyissues in multiprocessor support include inter-processorcommunication, DMA transfers and global memory sharing. Thebest known second generation floating-point DSP families areTexas instruments TMS320C4x and Analog Devices ADSP-2106x SHARC (Super Harvard Architecture Computer). The C4x

  • 7/27/2019 DSP Hardware

    36/44

    36

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    shares some of the architectural features of the C3x, but it wasdesigned for multiprocessing. The C40x family has good I/Ocapabilities it has six COMM ports for inter-processorcommunication and six 32-bit wide DMA channels for rapid data

    transfers. The architecture allows multiple operations to beperformed in parallel in one instruction cycle. The C4x familysupports both floating- and fixed-point arithmetic. The nativefloating-point data format in. the C40 differs from the IEEE754/854 standard, although conversion between them can bereadily accomplished.

    Analog Devices ADSP-2106x SHARC DSP processors are also

    32-bit floating- point devices. They have large internal memoryand impressive 1/0 capability 10 DMA channels to allowaccess to internal memory without intervention and six Link portsfor inter-processor communications at high speed. Thearchitecture allows shared global memory, making it possible forup to six SHARC processors to access each others internal RAMat up to full data rate. The ADSP-2106x family supports both thefixed-point and floating-point arithmetic. Its single precisionfloating-point format complies with the single precision IEEE754/854 floating-point standard (24-bit mantissa and 8-bitexponent). The architecture also supports multiple operations percycle.

    Third generation floating-point DSP processors take the conceptsof parallelism much farther to increase both the number ofinstructions and the number of operations in a cycle to meet thechallenges of multichannel and computationally intensive

    applications. This is achieved by the use of new architectures, theVLRV (very long instruction word) and superscalar architecturesin particular. The two leading third generation floating-point DSPprocessor families are the Texas Instruments TMS320C67x andAnalog Devices ADSP-TS001. The TMS320C67x family has the

  • 7/27/2019 DSP Hardware

    37/44

    37

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    same VLIW architecture as the advanced, fourth generation fixed-point DSP processors, TMS320C62x.

    The Tiger SHARC DSP family supports mixed arithmetic types

    (fixed and floating point arithmetic) and data types (8-, 16-, and32-bit numbers). This flexibility makes it possible to use thearithmetic and data type most appropriate for a given applicationto enhance performance. As with the TMS320C67x, the TigerSHARC is aimed at large-scale, multi-channel applications, suchas the third generation mobile systems (3G wireless), digitalsubscriber lines (xDSL) and remote, multiple access servermodems for Internet services. Tiger SHARC, with its static

    superscalar architecture, combines the good features of VLIWarchitecture, conventional DSP architecture, and RISCcomputers. The processor has two computation blocks, each witha multiplier, ALU and 64-bit shifter. The processor can execute upto eight MAC operations per cycle with 16-bit inputs and 40-bitaccumulation, two 40-bit MACs on 16-bit complex data or two 80-bit MACs with 32-bit data. With 8-bit data, Tiger SHARC can issueup to 16 operations in a cycle. Tiger SHARC has a wide memorybandwidth, with its memory organized in three 128-bit wide banks.Access to data can be in variable data sizes - normal 32-bitwords, long 64-bit words or quad 128-bit words. Up to four 32-bitinstructions can be issued in one cycle. To avoid the use of largeNOPs (which is a disadvantage of VLIW designs), the largeinstruction words may be broken down into separate shortinstructions which are issued to each unit independently.

  • 7/27/2019 DSP Hardware

    38/44

    38

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    Selecting digital signal processors:

    The choice of a DSP processor for a given application hasbecome an important issue in recent years because of the wide

    range of processors available (Levy. 1999; Berkeley DesignTechnology, 1996, 1999). Specific factors that may be consideredwhen selecting a DSP processor for an application includearchitectural features, execution speed, type of arithmetic andword length.

    1. Architectural features

    Most DSP processors available today have goodarchitectural features, but these may not be adequate for aspecific application. Key features of interest include size ofon-chip memory, special instructions and I/O capability. On-chip memory is an essential requirement in most real timeDSP applications for fast access to data and rapid programexecution. For memory hungry applications (e.g. digitalaudio , FAX/Modem, MPEG coding/decoding), the size ofinternal RAM may become an important distinguishingfactor. Where internal memory is insufficient this can beaugmented by high speed, off-chip memory, although thismay add to system costs. For applications that require fastand efficient communication or data flow with the outsideworld, I/O features such interface to ADC and DACs, DMAcapability and support for multiprocessing may be important.Depending on the application, a rich set of specialinstructions to support DSP operations are important, e.g.

    zero-overhead looping capability, dedicated DSPinstructions, and circular addressing.

  • 7/27/2019 DSP Hardware

    39/44

    39

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    2. Execution speed

    Speed of DSP processors is an important measure ofperformance because of the time-critical nature of most DSP

    tasks. Traditionally, the two main units of measurement forthis are the clock speed of the processor, in MHz, and thenumber of instructions performed, in millions of instructionsper second (MIPS) or, in the case of floating-point DSPprocessors, in millions of floating-point operations persecond (MFLOPS). However, such measures may beinappropriate in some cases because of significantdifferences in the way different DSP processors operate,

    with most able to perform multiple operations in one machineinstruction. For example, the C62x family of processors canexecute as many as eight instructions in a cycle. Thenumber of operations performed in each cycle also differsfrom processor to processor. Thus, comparison of executionspeed of processors based on such measures may not bemeaningful. An alternative measure is based on theexecution speed of bench-mark algorithms - e.g. DSPkernels such as FFT, FIR and IIR filters (Levy, 1998Berkeley Design Technology, 1999).

    3. Type of arithmetic

    The two most common types of arithmetic used in modernDSP processors are fixed- and floating-point arithmetic.Floating arithmetic is the natural choice for applications withwide and variable dynamic range requirements (dynamic

    range may be defined as the difference between the largestand smallest signal levels that can be represented or thedifference between the largest signal and the noise floor,measured in decibels). Fixed- point processors are favoredin low cost, high volume applications (e.g. celIular phonesand computer disk drives). The use of fixed-point arithmetic

  • 7/27/2019 DSP Hardware

    40/44

    40

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    raises issues associated with dynamic range constraintswhich the designer must address. In general, floatingprocessors are more expensive than fixed-point processors,although the cost difference hasfallen significantly in recent

    years. Most floating-point DSP processors available todayalso support fixed-point arithmetic.

    4. Word length

    Processor data word length is an important parameter inDSP as it can have a significant impact on signal quality, itdetermines how accurately parameters and results of DSP

    operations can be represented. In general, the longer thedata word the lower the errors that are introduced by digitalsignal processing. In fixed-point audio processing, forexample, a processor word length of at least 24 bits isrequired to keep the smallest signal level sufficiently abovethe noise floor generated by signal processing to maintainCD quality. A variety of processor word lengths are used infixed-point DSP processors, depending on application .Fixed-point DSP processors aimed at telecommunicationsmarkets tend to use a 16-bit word length (e.g.TMS320C54x), whereas those aimed at high quality audioapplications tend to use 24 bits (e.g. DSP56300). In recentyears there is a trend towards the use of more bits for theADC and DAC (e.g. Cirrus 24-bit audio codec, CS4228) asthe cost of these devices falls to meet the insatiable demandfor increased quality. Thus, there is likely to be an increaseddemand for larger processor word lengths for audio

    processing. In fixed-point processors, it may also benecessary to provide guard bits (typically I to 8 bits) in theaccumulators to prevent arithmetic overflows duringextended multiply and accumulate operations. The extra bitseffectively extend the dynamic range available in the DSPprocessor. In most floating- point DSP processors, a 32-bit

  • 7/27/2019 DSP Hardware

    41/44

    41

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    data size (24-bit mantissa and 8-bit exponent) is used forsingle-precision arithmetic. This size is also compatible withthe IEEE floating-point format (IEEE 754). Most floating-point DSP processors also have fixed-point arithmetic

    capability, and often support variable data size, fixed-pointarithmetic.

  • 7/27/2019 DSP Hardware

    42/44

    42

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    TMS320C6416 DSP Board

  • 7/27/2019 DSP Hardware

    43/44

    43

    EKT353 Lecture Notes by Professor Dr. Farid Ghani

    TMS320C6416 DSP Board

  • 7/27/2019 DSP Hardware

    44/44

    44

    Functional block and DSP core diagram for TMS320C6416 DSP