DSP APPLICATIONS · DSP APPLICATIONS UNIT -5 DISCRETE TIME SIGNAL ... PROCESSING . A Digital Signal...

DSP APPLICATIONS

UNIT -5 DISCRETE TIME SIGNAL PROCESSING

A Digital Signal Processing System

Analog

Signal

in

Analog

Signal

out

Antialiasing

Filter

Sample

and Hold A/DD

S

P

D/AReconst.

Filter

9/28/2019 2

A perspective of the Digital Signal Processing problemApplication areas

Radar Speech Seismic ImageMedical • • •

Digital signal processing theory

Basic functions

Processor

instruction sets

and/or hardware

functions

Component technology

Theoretical

problem

modelling

Algorithms

Implementation

Architechtures

9/28/2019 3

TOPIC 1 : DSP FUNCTIONALITIES

APPLICATION REQUIREMENTS PROCESSOR ATTRIBUTES

REAL-TIME PROCESSING HIGH SPEED, HIGH THROUGHPUT

LARGE ARRAY OF DATA INSTRUCTIONS TO MOVE AND

PROCESS LARGE DATA ARRAYS

ALGORITHM INTENSIVE FAST MATHEMATICAL

COMPUTATIONS, SINGLE CYCLE

DSP OPERATIONS (MACD)

SYSTEM FLEXIBILITY GENERAL PURPOSE

PROGRAMMABILITY, EPROM,

MC/MP MODE9/28/2019 4

Functions of DSP Processors

Caters to high arithmetic demands

Real time operation

Analog input / output

Large number of functional units for a

given size

Small control Logic9/28/2019 5

Different approaches to hardwareimplementation

1. HIGH SPEED GENERAL PURPOSE COMPUTERS

Programmable Expensive

Can be configured for Complex control

different applications I/O overheads

2. CUSTOM-DESIGNED VLSI COMPONENTS

Efficient design Application specific

Large throughputs High development cost9/28/2019 6

3. GENERAL PURPOSE DIGITAL SIGNAL PROCESSORS

Combine the Programmability & Control features of general

purpose computers and the Architectural innovations of

special purpose chips.

GOALS: HIGH SPEED, LOW POWER AND LOW COST

9/28/2019 7

2marks : Why are conventional Processors not suitable for DSP?

Caches are a waste of chip area

Small register files force lots of memory

accesses

- these are different from cache since

these are program managed

Complex instruction issue logic, branch prediction,

speculation etc. are not needed for DSP

Not enough ALU function

Not adequate dynamic range and precision9/28/2019 8

Data Processing vs

Signal Processing

• General-purpose microprocessors are designed primarily for Data Processing.

– The primary burden is Data Read/Write

• Digital Signal Processors are Microprocessors specifically designed for Signal Processing.

– The primary burden is Mathematical operation

• DSP architecture therefore incorporates certain features not found in general-purpose P’s.

9/28/2019 9

TOPIC 2 : CIRCULAR BUFFERING

• A critical element of Digital Signal Processors.

• Definition : Circular buffering is defined as the basic addressing capability of DSP processors which allows us to significantly accelerate the data transfer in a real-time system.

• Main goal : to accelerate the calculations while keeping the power consumption as low as possible.

• Example : FIR filter will show the typical properties of various DSP algorithms.

9/28/2019 10

Example : Performing FIR filtering on a real time input

• Assume that we have a four-tap filter with the following difference equation:y(n)=b0x(n)+b1x(n-1)+b2x(n-2)+b3x(n-3).

• This example requires the following steps for each input sample which is called Multiply and Accumulate(MAC) cycle.

9/28/2019 11

MAC Execution Cycle

• Obtain a sample of the Input signal

• Move the input sample into the input buffer

• Fetch the co-efficient from internal buffer

• Multiply the input sample by the co-efficient

• Add the product to the Accumulator

• Move the output to the output buffer

• Send it out as a sample of the output signal

9/28/2019 12

*

+

ProgramCounter

Data Read Address

Data Write Address

Program/ Coefficient

Memory

CPU

Data Memory

MAC Execution Hardware

ACC

9/28/2019 13

Circular address allocation of the following example

9/28/2019 14

TOPIC 3 : DSP ARCHITECTURE

• General-purpose processors are based on the Von Neumann architecture (single memory bank and processor accesses this memory bank thro’ single set of address and data lines)

• Harvard architecture commonly used in DSP processors

– Separate Data and Program memories (two memory banks)

– Separate Address Buses for Data and Program memories

9/28/2019 15

Features of a DSP architecture

• Instruction Cache and Pipelined processor

as in any modern microprocessor, but no

Data Cache

• Separate ALU, Multiplier and Shifter,

connected through multiple internal data

buses, enabling fast MAC operations

9/28/2019 16

DSP, CISC and RISC

• DSP Processors can’t be called truly as

CISC or RISC-type of processors

• Some features present in a RISC processor

may exist. However, DSP processors are

“tuned” towards operations encountered in

signal processing applications

9/28/2019 17

DSP ARCHITECTURE IMPLEMENTATION

Important desirable characteristics

Adequate word length

Fast multiply & accumulate

High speed RAM

Fast Coefficient table addressing

Fast new sample fetch mechanism9/28/2019 18

GENERAL PURPOSE DSP FEATURES

1. PARALLELISM: Multiple Functional Units

Multiple Buses

Multiple Memories

2. PIPELINING

3. HARDWARE MULTIPLIERS AND OTHER ARITHMETIC FUNCTIONS

4. ON-CHIP AND CACHE MEMORIES

5. A VARIETY OF ADDRESSING MODES

7. INSTRUCTIONS THAT PACK SEVERAL OPERATIONS

8. ZERO-OVERHEAD LOOPING

9. I/O FEATURES SUCH AS INTERRUPT, SERIAL I/O, DMA

10. OTHER CONTROL FUNCTIONS SUCH AS WAIT STATES

9/28/2019 19

A Typical DSP Architecture

ProgramMemory

(PM)

Instruc-tions &

secondary data

DataMemory

(DM)

Data only

PM Data

Address

Generator

DM Data

Address

Generator

Program Sequencer

Instruction Cache

Registers

Multiplier

ALU

Shifter

I/O Controller

(DMA)

PM Address DM Address

PM Data DM Data

Input/Output

DMA Bus

9/28/2019 20

SHIFTERS- Scales numbers to prevent overflow/underflow

- Conversion between fixed point and floating point

- Many bits must be shifted in a single cycle to preserve single cycle computational speed (Barrel Shifter)

- Logical shift assumes unsigned data and fills with zeroes left or right

- Arithmetic shift scales numbers upwards (left) or downward (right)

zero fills sign extend

- Normalization/de-normalization for block floating point9/28/2019 21

Memory

Traditional µPs : register to register(limited memory bandwidth)

DSPs : memory to memory(higher memory bandwidth)

upto six memory fetches in an inst. cycle Parallel memory banks: small, fast and simple memories.

Internal Vs External

Pincount limitationSpeed penalty Off-chip bussing

Internal busses are multiplexed to the outside. Expand only one memory off-chip.

9/28/2019 22

BASIC HARVARD ARCHITECTURE

DATA

MEMORY

PROGRAM

MEMORY

MODIFICATION#2

PROGRAM

MEMORY

MULTI-PORT

DATA

MEMORY

MODIFICATION #1

PROGRAM

/DATA

MEMORY

DATA

MEMORY

MEMORY ORGANISATION - I

9/28/2019 23

PROGRA

M

MEMORY

DATA

MEMORY

DATA

MEMORY

MODIFICATION #4

I/O

PROGRAM

MEMORY

PROGRAM

MEMORYDATA

MEMORY

DATA

MEMORY

DATA

MEMORY

DATA

MEMORY

MODIFICATION #5

MODIFICATION #3

PROGRA

M CACHE

PROGRAM/

DATA

MEMORY

DATA

MEMORY

MEMORY ORGANISATION - II

9/28/2019 24

Delay Delayx(n) x(n-2)

h(1) h(2)

y(n)+ +

Example : DSP architecture of a second order FIR filter

h(0)

x(n-1)

y(n) = h(0)x(n) + h(1)x(n -1) + h(2)x(n-2)9/28/2019 25

Organization of signal samples and filter coefficients for a second order FIR filter implementation

x(n+1)

y(n)

x(n)

x(n-1)

x(n-2)

h(0)

h(1)

h(2)

Delay

Delay

MAC

ar2ar1

9/28/2019 26

An Nth order FIR filter implementation

A[0]

A[1]

A[2]

• •

• •

• •

A[N-1]

Coefficient Memory

*

+

X[n]

X[n-1]

X[n-2]

• •

• •

• •

X[n-N+1]

Data Memory

ACC y[n]

P

9/28/2019 27

Salient Features

• REPEAT-MAC instruction

- Performs auto-increment of both coefficient and data pointers

- Frees up program memory bus for fetching coefficients

• Circular buffer

- to manage data movement at the end of every output computation

• Handling precision

- Accumulator guard bits

- Saturation mode

- Shifters (both right and left shift)9/28/2019 28

TOPIC 4 : FIXED AND FLOATING POINT ARCHITECTURE PRINCIPLES

Fixed point Vs Floating pointArray indices, Loop Wider dynamic range

counters etc. frees user from scaling concerns

Less sensitive to error accumulation

Overflow/underflow 50% slower for same

management technology

Error budget for Higher Cost

word length growth Normalize after each operation

Mantissa round off (some accuracy is traded)

9/28/2019 29

Fixed point does not always limit performance:

e.g., for dynamic range of 50 to 60 dB, 12 -bit

quantization (step size of -72 dB) is more than

adequate. For Hi-fi audio with 80 dB dynamic

range, 16 bits (-96 dB) are more than adequate

9/28/2019 30

Overflow Management

SHIFT

Left shift removes redundant sign bit after 2’s

complement multiplication

Right shift down scales numbers as word growth is

detected

Unbiased rounding

Prevents accumulation of a small dc bias from

outputs which fall just half way between adjacent

rounded values9/28/2019 31

Saturation Logic

Sets the contents of register to maximize the

value if overflow occurs

Block Floating Point

Scaling logic + exponent register: If overflow

condition of any point is detected, the entire

array is rescaled downwards and the scaling is

stored in the block exponent register.

9/28/2019 32

TOPIC 5: DSP PROGRAMMING : FIR Filter pseudo-code

Load loop count

Initialize coefficient and data addr regs

Zero Acc and P registers

LOOP: Pnew = A[i] . X[n-i]

Accnew = ACCold + Pold

Decrement coefficient and data addr regs

X[n-i] X[n-i-1] {for next iteration}

Decr loop count

BNZ LOOP

Acc Y[n]9/28/2019 33

Addressing Modes

• Short immediate addressing mode

• Short direct addressing mode

• Memory mapped addressing mode

• Circular buffer addressing mode

• Bit-reversed addressing mode

9/28/2019 34

Different types of DSP architecture:

(i) Super scalar architecture

• Hardware responsible for finding ILP in a sequential program

• Advantage : Compatibility between generations

• Disadvantage : Very complex hardware

9/28/2019 35

(ii) Explicitly Parallel Instruction Computing (EPIC)

• Combines VLIW and super scalar architectures

• Instructions are grouped into 3 operating blocks and a template block

• Template block tells hardware if instructions can be executed in parallel

• Also gives information whether the block can be executed in parallel

9/28/2019 36

(iii) Instruction Level Processors

Increasing instructions / cycle

Requires fewer cycles to execute a task

Uses longer clock for same performance

Uses lower supply voltage

And hence uses less power

However, too many functional units and too many transitions per clock cycle increase power consumption.

9/28/2019 37

(iv)Low Power architecture and VLIW(Very Long InstructionWord) processors

Power consumed by additional circuits vs. ability to lower clock rate while maintaining performance

Circuits must be highly used

Move complexity into software

Voltage scaling : Reduce Vdd

Clock gating : Turn off clock when chip is not in use ( applies to sub-modules of chip also)

9/28/2019 38

VLIW is more suitable than super scalar for low power

- VLIW is smaller for same number of functional units

- Compiler is better at finding parallelism than hardware

Put multiple processors on chip rather than lots of functional units in one processor

Helps in running independent tasks

9/28/2019 39

(vi) Improvement of Speed by Pipelining

• Processor speed can be enhanced by having separate hardware units for the different functional blocks, with buffers between the successive units.

– The number of unit operations into which the instruction cycle of a processor can be divided for this purpose defines the number of stages in the pipeline.

– A processor having an n-stage pipeline would have up to n instructions simultaneously being processed by the different functional units of the processor.

• Effective processor speed increases ideally by a factor equal to the number of pipelining stages.

9/28/2019 40

A Four-stage Pipeline

9/28/2019 41

Data Dependency in Pipelining

If the input data for an instruction depends on the outcome of the previous instruction, the Write cycle of the previous instruction has to be over before the Operate cycle of the next instruction can start. The pipeline effectively idles through one instruction, creating a bubble in the pipeline which persists for several instructions.

F4 D4

O3

F2 D2 idle W2O2

W4

F3 idle D3 W3

O4

Bubbleendshere

F1 D1 O1 W1

9/28/2019 42

Example of dependency

• A 3 + A; B 4 x A

Can’t perform these two in parallel

• Another case: A = B + A; B = A – B; A =

A – B (swapping without temp) ; examine

how you can handle this.

9/28/2019 43

Branch Dependency in PipeliningA Branch instruction can cause a pipeline stall if the branch is taken, as the next instruction has to be aborted in that case. If I1 is an unconditional branch instruction, the next Fetch cycle (F2) can start after D1. But if I1 is a conditional branch instruction, F2 has to wait until O1 for the decision as to whether the branch will be taken or not.

F1 D1 O1 W1

F2 D2 O2 W2 executed if branch is not taken

F2 D2 O2 W2

F2 D2 O2 W2

executed for unconditional branch

for conditionalbranch, if taken

branch instruction

9/28/2019 44

Memory block conflicts: If both instruction and data are to be

fetched from the same block of memory, a stall is

automatically inserted

DAG usage immediately (or within 2 cycles) after

initialization. e.g.

I2 = 0x1234;

AX0 = DM(I2,M2);

Bus conflicts: Instructions which use the PMA/PMD buses for

data transfer may cause bus conflict. e.g.

PM(I5,M7)=M3;

Causes for Pipeline Stalls

9/28/2019 45

TOPIC 6 :

APPLICATION EXAMPLES

-Example 1 : TMS320C25

9/28/2019 47

TMS320C25 KEY FEATURES

INTERRUPTS

+5 v GND

288 x 16

DATA RAM

256 x 16

DATA/

PROGRAM

4K x 16 PROGRAM ROM

32-BIT ALU/ACC

16 x 16

MULTIPLIER

DATA

16

MULTI-

PROCESSOR

INTERFACE

SERIAL

INTERFACE

ADDRESS

16

100 ns INSTRUCTION CYCLE TIME

128K-WORDS TOTAL MEMORY

SPACE

THREE PARALLEL SHIFTERS

133 GENERAL PURPOSE AND DSP

INSTRUCTIONS

S/W UPWARD COMPATIBLE WITH

PREVIOUS FAMILY MEMBERS

1.8u CMOS: 68-PIN PLCC / PGA

9/28/2019 48

TMS320C25 GENERAL PURPOSE FEATURES

X = X - Y

BIT 16=0

OUTPUT X

NO

YES

COMPREHENSIVE INSTRUCTION SET-133 INSTRUCTIONS INCLUDING

- NUMERICAL (34)- LOGICAL (15)- MEMORY MANAGEMENT (33)- BRANCHES (20)- PROGRAM/MODE CONTROL (31)

EXTENDED-PRECISION ARITHMETIC

SERIAL PORT (DOUBLE BUFFERED, STATIC)

MULTIPROCESSOR INTERFACES (CONCURRENT DMA, GLOBAL DATA MEMORY)

BLOCK MOVES (UP TO 10 M WORDS/SEC)

ON-CHIP TIMER

THREE EXTERNAL MASKABLE INTERRUPTS

POWERDOWN MODE9/28/2019 49

TMS320C25 ALU

DESIGN & OPERATION

32-BIT ALU & ACCUMULATOR

CARRY BIT FOR EXTENDED PRECISION

OVERFLOW DETECTION & SATURATION

SIGN EXTENSION OPTION

0-16 BIT PARALLEL SHIFTER FOR LOADS AND ARITHMETIC OPS

SHIFTERS ON PRODUCT REGISTER OUTPUT DATA

0-7 BIT PARALLEL SHIFTER FOR ACCUMULATOR STORES9/28/2019 50

TMS320C25 - MULTIPLY INSTRUCTIONS II

MAC MPY data memory * program memory & add past P-Reg to ACC

MACD MPY data memory * program memory, add past P-Reg to ACC, & move data memory

SQRA Square data memory value & add past P-Reg to ACC

SQRS Square data memory value & sub past P-Reg from ACC

9/28/2019 51

Z-1 Z-1 Z-1 Z-1

x x x x

xn

Yn

Yn = bK X(n-K)

N

K=0TMS320C25

RPTK 49 MACD

3 WORDS PROG MEMORY 53 CYCLES

TMS320C25 - HIGHER PERFORMANCE AT LESS CODE SPACE

9/28/2019 52

IMMEDIATE ADDRESSING- BOTH LONG AND SHORT CONSTANTS

- EXAMPLES:

ADDK 5

ADLK > 1325

DIRECT ADDRESSING

- SAME AS TMS320C1X BUT DP IS 9 BITS- 512 “BANKS” OF 128 WORDS- USED OFTEN FOR LONG SEQUENCES

OF IN-LINE CODE

INDIRECT ADDRESSING- B AUXILIARY REGISTERS- USED OFTEN IN PROGRAM LOOPS WITH AUTO

INC/DEC OPTIONS

TMS320C25 ADDRESSING MODES

Program memory

ADDK 5

ADDLK

1325

9 BITS 7 BITS

OPERAND ADDRESS

DPFrom

instruction

9/28/2019 53

Addressing Mode (contd.)

• Circular buffer addressing mode

• Bit-reversed addressing mode

9/28/2019 54

BLOCK DIAGRAM OF A TMS320C5X DSP

9/28/2019 55

General-Purpose Microprocessor

Example 2 : circa 1984 : Intel 8088

~100,000 transistors

Clock speed : ~ 5 MHz

Address space : 20 bits

Bus width : 8 bits

100+ instructions

2-35 cycles per instruction

Micro-coded architecture 9/28/2019 56

Example 3 :DSP TMS 32010 1984

Clock 20 MHz

16 bits

8, 12 bits addressing space

~ 50 k transistors

~ 35 instructions

Harvard architecture

Hardware multiplier

Double length accumulator with saturation

A few special DSP instructions

Relatively inexpensive9/28/2019 57

Example 4 :General Purpose Microprocessor 2000

GHz clock speed

32-bit address or more

32-bit bus, 128-bit instructions

Complex MMU

Super scalar CPU

MMX instructions

On chip cache

Single cycle execution

32-bit floating point ALU on board

Very expensive

10s of watts of power9/28/2019 58

DSP in 2000

Clock 100 ~ 200 MHz

16-bit floating point or 32-bit floating point

16-24 bits address space

Large on-chip and off-chip memories

Single cycle execution of most instructions

Harvard architecture

Lots of special DSP instructions

50 mw to 2w power

Cheap9/28/2019 59

Example 5 :Future of DSP Microprocessor

Sufficiently unique for an independent class of applications (HDD, cell phone)

Low power consumption, low cost

High performance within power, cost

constraints (MIPS/mw, MIPS/$)

Fixed point & floating point

Better compilers - but users must be informed

Hybrid DSP/ GP systems 9/28/2019 60

DSP APPLICATIONS · DSP APPLICATIONS UNIT -5 DISCRETE TIME SIGNAL ... PROCESSING . A Digital Signal...

Documents

Transcript of DSP APPLICATIONS · DSP APPLICATIONS UNIT -5 DISCRETE TIME SIGNAL ... PROCESSING . A Digital Signal...