DSP APPLICATIONS · DSP APPLICATIONS UNIT -5 DISCRETE TIME SIGNAL ... PROCESSING . A Digital Signal...
Transcript of DSP APPLICATIONS · DSP APPLICATIONS UNIT -5 DISCRETE TIME SIGNAL ... PROCESSING . A Digital Signal...
DSP APPLICATIONS
UNIT -5 DISCRETE TIME SIGNAL PROCESSING
A Digital Signal Processing System
Analog
Signal
in
Analog
Signal
out
Antialiasing
Filter
Sample
and Hold A/DD
S
P
D/AReconst.
Filter
9/28/2019 2
A perspective of the Digital Signal Processing problemApplication areas
Radar Speech Seismic ImageMedical • • •
Digital signal processing theory
Basic functions
Processor
instruction sets
and/or hardware
functions
Component technology
Theoretical
problem
modelling
Algorithms
Implementation
Architechtures
9/28/2019 3
TOPIC 1 : DSP FUNCTIONALITIES
APPLICATION REQUIREMENTS PROCESSOR ATTRIBUTES
REAL-TIME PROCESSING HIGH SPEED, HIGH THROUGHPUT
LARGE ARRAY OF DATA INSTRUCTIONS TO MOVE AND
PROCESS LARGE DATA ARRAYS
ALGORITHM INTENSIVE FAST MATHEMATICAL
COMPUTATIONS, SINGLE CYCLE
DSP OPERATIONS (MACD)
SYSTEM FLEXIBILITY GENERAL PURPOSE
PROGRAMMABILITY, EPROM,
MC/MP MODE9/28/2019 4
Functions of DSP Processors
Caters to high arithmetic demands
Real time operation
Analog input / output
Large number of functional units for a
given size
Small control Logic9/28/2019 5
Different approaches to hardwareimplementation
1. HIGH SPEED GENERAL PURPOSE COMPUTERS
Programmable Expensive
Can be configured for Complex control
different applications I/O overheads
2. CUSTOM-DESIGNED VLSI COMPONENTS
Efficient design Application specific
Large throughputs High development cost9/28/2019 6
3. GENERAL PURPOSE DIGITAL SIGNAL PROCESSORS
Combine the Programmability & Control features of general
purpose computers and the Architectural innovations of
special purpose chips.
GOALS: HIGH SPEED, LOW POWER AND LOW COST
9/28/2019 7
2marks : Why are conventional Processors not suitable for DSP?
Caches are a waste of chip area
Small register files force lots of memory
accesses
- these are different from cache since
these are program managed
Complex instruction issue logic, branch prediction,
speculation etc. are not needed for DSP
Not enough ALU function
Not adequate dynamic range and precision9/28/2019 8
Data Processing vs
Signal Processing
• General-purpose microprocessors are designed primarily for Data Processing.
– The primary burden is Data Read/Write
• Digital Signal Processors are Microprocessors specifically designed for Signal Processing.
– The primary burden is Mathematical operation
• DSP architecture therefore incorporates certain features not found in general-purpose P’s.
9/28/2019 9
TOPIC 2 : CIRCULAR BUFFERING
• A critical element of Digital Signal Processors.
• Definition : Circular buffering is defined as the basic addressing capability of DSP processors which allows us to significantly accelerate the data transfer in a real-time system.
• Main goal : to accelerate the calculations while keeping the power consumption as low as possible.
• Example : FIR filter will show the typical properties of various DSP algorithms.
9/28/2019 10
Example : Performing FIR filtering on a real time input
• Assume that we have a four-tap filter with the following difference equation:y(n)=b0x(n)+b1x(n-1)+b2x(n-2)+b3x(n-3).
• This example requires the following steps for each input sample which is called Multiply and Accumulate(MAC) cycle.
9/28/2019 11
MAC Execution Cycle
• Obtain a sample of the Input signal
• Move the input sample into the input buffer
• Fetch the co-efficient from internal buffer
• Multiply the input sample by the co-efficient
• Add the product to the Accumulator
• Move the output to the output buffer
• Send it out as a sample of the output signal
9/28/2019 12
*
+
ProgramCounter
Data Read Address
Data Write Address
Program/ Coefficient
Memory
CPU
Data Memory
MAC Execution Hardware
ACC
9/28/2019 13
Circular address allocation of the following example
9/28/2019 14
TOPIC 3 : DSP ARCHITECTURE
• General-purpose processors are based on the Von Neumann architecture (single memory bank and processor accesses this memory bank thro’ single set of address and data lines)
• Harvard architecture commonly used in DSP processors
– Separate Data and Program memories (two memory banks)
– Separate Address Buses for Data and Program memories
9/28/2019 15
Features of a DSP architecture
• Instruction Cache and Pipelined processor
as in any modern microprocessor, but no
Data Cache
• Separate ALU, Multiplier and Shifter,
connected through multiple internal data
buses, enabling fast MAC operations
9/28/2019 16
DSP, CISC and RISC
• DSP Processors can’t be called truly as
CISC or RISC-type of processors
• Some features present in a RISC processor
may exist. However, DSP processors are
“tuned” towards operations encountered in
signal processing applications
9/28/2019 17
DSP ARCHITECTURE IMPLEMENTATION
Important desirable characteristics
Adequate word length
Fast multiply & accumulate
High speed RAM
Fast Coefficient table addressing
Fast new sample fetch mechanism9/28/2019 18
GENERAL PURPOSE DSP FEATURES
1. PARALLELISM: Multiple Functional Units
Multiple Buses
Multiple Memories
2. PIPELINING
3. HARDWARE MULTIPLIERS AND OTHER ARITHMETIC FUNCTIONS
4. ON-CHIP AND CACHE MEMORIES
5. A VARIETY OF ADDRESSING MODES
7. INSTRUCTIONS THAT PACK SEVERAL OPERATIONS
8. ZERO-OVERHEAD LOOPING
9. I/O FEATURES SUCH AS INTERRUPT, SERIAL I/O, DMA
10. OTHER CONTROL FUNCTIONS SUCH AS WAIT STATES
9/28/2019 19
A Typical DSP Architecture
ProgramMemory
(PM)
Instruc-tions &
secondary data
DataMemory
(DM)
Data only
PM Data
Address
Generator
DM Data
Address
Generator
Program Sequencer
Instruction Cache
Registers
Multiplier
ALU
Shifter
I/O Controller
(DMA)
PM Address DM Address
PM Data DM Data
Input/Output
DMA Bus
9/28/2019 20
SHIFTERS- Scales numbers to prevent overflow/underflow
- Conversion between fixed point and floating point
- Many bits must be shifted in a single cycle to preserve single cycle computational speed (Barrel Shifter)
- Logical shift assumes unsigned data and fills with zeroes left or right
- Arithmetic shift scales numbers upwards (left) or downward (right)
zero fills sign extend
- Normalization/de-normalization for block floating point9/28/2019 21
Memory
Traditional µPs : register to register(limited memory bandwidth)
DSPs : memory to memory(higher memory bandwidth)
upto six memory fetches in an inst. cycle Parallel memory banks: small, fast and simple memories.
Internal Vs External
Pincount limitationSpeed penalty Off-chip bussing
Internal busses are multiplexed to the outside. Expand only one memory off-chip.
9/28/2019 22
BASIC HARVARD ARCHITECTURE
DATA
MEMORY
PROGRAM
MEMORY
MODIFICATION#2
PROGRAM
MEMORY
MULTI-PORT
DATA
MEMORY
MODIFICATION #1
PROGRAM
/DATA
MEMORY
DATA
MEMORY
MEMORY ORGANISATION - I
9/28/2019 23
PROGRA
M
MEMORY
DATA
MEMORY
DATA
MEMORY
MODIFICATION #4
I/O
PROGRAM
MEMORY
PROGRAM
MEMORYDATA
MEMORY
DATA
MEMORY
DATA
MEMORY
DATA
MEMORY
MODIFICATION #5
MODIFICATION #3
PROGRA
M CACHE
PROGRAM/
DATA
MEMORY
DATA
MEMORY
MEMORY ORGANISATION - II
9/28/2019 24
Delay Delayx(n) x(n-2)
h(1) h(2)
y(n)+ +
Example : DSP architecture of a second order FIR filter
h(0)
x(n-1)
y(n) = h(0)x(n) + h(1)x(n -1) + h(2)x(n-2)9/28/2019 25
Organization of signal samples and filter coefficients for a second order FIR filter implementation
x(n+1)
y(n)
x(n)
x(n-1)
x(n-2)
h(0)
h(1)
h(2)
Delay
Delay
MAC
ar2ar1
9/28/2019 26
An Nth order FIR filter implementation
A[0]
A[1]
A[2]
• •
• •
• •
A[N-1]
Coefficient Memory
*
+
X[n]
X[n-1]
X[n-2]
• •
• •
• •
X[n-N+1]
Data Memory
ACC y[n]
P
9/28/2019 27
Salient Features
• REPEAT-MAC instruction
- Performs auto-increment of both coefficient and data pointers
- Frees up program memory bus for fetching coefficients
• Circular buffer
- to manage data movement at the end of every output computation
• Handling precision
- Accumulator guard bits
- Saturation mode
- Shifters (both right and left shift)9/28/2019 28
TOPIC 4 : FIXED AND FLOATING POINT ARCHITECTURE PRINCIPLES
Fixed point Vs Floating pointArray indices, Loop Wider dynamic range
counters etc. frees user from scaling concerns
Less sensitive to error accumulation
Overflow/underflow 50% slower for same
management technology
Error budget for Higher Cost
word length growth Normalize after each operation
Mantissa round off (some accuracy is traded)
9/28/2019 29
Fixed point does not always limit performance:
e.g., for dynamic range of 50 to 60 dB, 12 -bit
quantization (step size of -72 dB) is more than
adequate. For Hi-fi audio with 80 dB dynamic
range, 16 bits (-96 dB) are more than adequate
9/28/2019 30
Overflow Management
SHIFT
Left shift removes redundant sign bit after 2’s
complement multiplication
Right shift down scales numbers as word growth is
detected
Unbiased rounding
Prevents accumulation of a small dc bias from
outputs which fall just half way between adjacent
rounded values9/28/2019 31
Saturation Logic
Sets the contents of register to maximize the
value if overflow occurs
Block Floating Point
Scaling logic + exponent register: If overflow
condition of any point is detected, the entire
array is rescaled downwards and the scaling is
stored in the block exponent register.
9/28/2019 32
TOPIC 5: DSP PROGRAMMING : FIR Filter pseudo-code
Load loop count
Initialize coefficient and data addr regs
Zero Acc and P registers
LOOP: Pnew = A[i] . X[n-i]
Accnew = ACCold + Pold
Decrement coefficient and data addr regs
X[n-i] X[n-i-1] {for next iteration}
Decr loop count
BNZ LOOP
Acc Y[n]9/28/2019 33
Addressing Modes
• Short immediate addressing mode
• Short direct addressing mode
• Memory mapped addressing mode
• Circular buffer addressing mode
• Bit-reversed addressing mode
9/28/2019 34
Different types of DSP architecture:
(i) Super scalar architecture
• Hardware responsible for finding ILP in a sequential program
• Advantage : Compatibility between generations
• Disadvantage : Very complex hardware
9/28/2019 35
(ii) Explicitly Parallel Instruction Computing (EPIC)
• Combines VLIW and super scalar architectures
• Instructions are grouped into 3 operating blocks and a template block
• Template block tells hardware if instructions can be executed in parallel
• Also gives information whether the block can be executed in parallel
9/28/2019 36
(iii) Instruction Level Processors
Increasing instructions / cycle
Requires fewer cycles to execute a task
Uses longer clock for same performance
Uses lower supply voltage
And hence uses less power
However, too many functional units and too many transitions per clock cycle increase power consumption.
9/28/2019 37
(iv)Low Power architecture and VLIW(Very Long InstructionWord) processors
Power consumed by additional circuits vs. ability to lower clock rate while maintaining performance
Circuits must be highly used
Move complexity into software
Voltage scaling : Reduce Vdd
Clock gating : Turn off clock when chip is not in use ( applies to sub-modules of chip also)
9/28/2019 38
VLIW is more suitable than super scalar for low power
- VLIW is smaller for same number of functional units
- Compiler is better at finding parallelism than hardware
Put multiple processors on chip rather than lots of functional units in one processor
Helps in running independent tasks
9/28/2019 39
(vi) Improvement of Speed by Pipelining
• Processor speed can be enhanced by having separate hardware units for the different functional blocks, with buffers between the successive units.
– The number of unit operations into which the instruction cycle of a processor can be divided for this purpose defines the number of stages in the pipeline.
– A processor having an n-stage pipeline would have up to n instructions simultaneously being processed by the different functional units of the processor.
• Effective processor speed increases ideally by a factor equal to the number of pipelining stages.
9/28/2019 40
A Four-stage Pipeline
9/28/2019 41
Data Dependency in Pipelining
If the input data for an instruction depends on the outcome of the previous instruction, the Write cycle of the previous instruction has to be over before the Operate cycle of the next instruction can start. The pipeline effectively idles through one instruction, creating a bubble in the pipeline which persists for several instructions.
F4 D4
O3
F2 D2 idle W2O2
W4
F3 idle D3 W3
O4
Bubbleendshere
F1 D1 O1 W1
9/28/2019 42
Example of dependency
• A 3 + A; B 4 x A
Can’t perform these two in parallel
• Another case: A = B + A; B = A – B; A =
A – B (swapping without temp) ; examine
how you can handle this.
9/28/2019 43
Branch Dependency in PipeliningA Branch instruction can cause a pipeline stall if the branch is taken, as the next instruction has to be aborted in that case. If I1 is an unconditional branch instruction, the next Fetch cycle (F2) can start after D1. But if I1 is a conditional branch instruction, F2 has to wait until O1 for the decision as to whether the branch will be taken or not.
F1 D1 O1 W1
F2 D2 O2 W2 executed if branch is not taken
F2 D2 O2 W2
F2 D2 O2 W2
executed for unconditional branch
for conditionalbranch, if taken
branch instruction
9/28/2019 44
Memory block conflicts: If both instruction and data are to be
fetched from the same block of memory, a stall is
automatically inserted
DAG usage immediately (or within 2 cycles) after
initialization. e.g.
I2 = 0x1234;
AX0 = DM(I2,M2);
Bus conflicts: Instructions which use the PMA/PMD buses for
data transfer may cause bus conflict. e.g.
PM(I5,M7)=M3;
Causes for Pipeline Stalls
9/28/2019 45
TOPIC 6 :
APPLICATION EXAMPLES
-Example 1 : TMS320C25
9/28/2019 47
TMS320C25 KEY FEATURES
INTERRUPTS
+5 v GND
288 x 16
DATA RAM
256 x 16
DATA/
PROGRAM
4K x 16 PROGRAM ROM
32-BIT ALU/ACC
16 x 16
MULTIPLIER
DATA
16
MULTI-
PROCESSOR
INTERFACE
SERIAL
INTERFACE
ADDRESS
16
100 ns INSTRUCTION CYCLE TIME
128K-WORDS TOTAL MEMORY
SPACE
THREE PARALLEL SHIFTERS
133 GENERAL PURPOSE AND DSP
INSTRUCTIONS
S/W UPWARD COMPATIBLE WITH
PREVIOUS FAMILY MEMBERS
1.8u CMOS: 68-PIN PLCC / PGA
9/28/2019 48
TMS320C25 GENERAL PURPOSE FEATURES
X = X - Y
BIT 16=0
OUTPUT X
NO
YES
COMPREHENSIVE INSTRUCTION SET-133 INSTRUCTIONS INCLUDING
- NUMERICAL (34)- LOGICAL (15)- MEMORY MANAGEMENT (33)- BRANCHES (20)- PROGRAM/MODE CONTROL (31)
EXTENDED-PRECISION ARITHMETIC
SERIAL PORT (DOUBLE BUFFERED, STATIC)
MULTIPROCESSOR INTERFACES (CONCURRENT DMA, GLOBAL DATA MEMORY)
BLOCK MOVES (UP TO 10 M WORDS/SEC)
ON-CHIP TIMER
THREE EXTERNAL MASKABLE INTERRUPTS
POWERDOWN MODE9/28/2019 49
TMS320C25 ALU
DESIGN & OPERATION
32-BIT ALU & ACCUMULATOR
CARRY BIT FOR EXTENDED PRECISION
OVERFLOW DETECTION & SATURATION
SIGN EXTENSION OPTION
0-16 BIT PARALLEL SHIFTER FOR LOADS AND ARITHMETIC OPS
SHIFTERS ON PRODUCT REGISTER OUTPUT DATA
0-7 BIT PARALLEL SHIFTER FOR ACCUMULATOR STORES9/28/2019 50
TMS320C25 - MULTIPLY INSTRUCTIONS II
MAC MPY data memory * program memory & add past P-Reg to ACC
MACD MPY data memory * program memory, add past P-Reg to ACC, & move data memory
SQRA Square data memory value & add past P-Reg to ACC
SQRS Square data memory value & sub past P-Reg from ACC
9/28/2019 51
Z-1 Z-1 Z-1 Z-1
x x x x
xn
Yn
Yn = bK X(n-K)
N
K=0TMS320C25
RPTK 49 MACD
3 WORDS PROG MEMORY 53 CYCLES
TMS320C25 - HIGHER PERFORMANCE AT LESS CODE SPACE
9/28/2019 52
IMMEDIATE ADDRESSING- BOTH LONG AND SHORT CONSTANTS
- EXAMPLES:
ADDK 5
ADLK > 1325
DIRECT ADDRESSING
- SAME AS TMS320C1X BUT DP IS 9 BITS- 512 “BANKS” OF 128 WORDS- USED OFTEN FOR LONG SEQUENCES
OF IN-LINE CODE
INDIRECT ADDRESSING- B AUXILIARY REGISTERS- USED OFTEN IN PROGRAM LOOPS WITH AUTO
INC/DEC OPTIONS
TMS320C25 ADDRESSING MODES
Program memory
ADDK 5
ADDLK
1325
9 BITS 7 BITS
OPERAND ADDRESS
DPFrom
instruction
9/28/2019 53
Addressing Mode (contd.)
• Circular buffer addressing mode
• Bit-reversed addressing mode
9/28/2019 54
BLOCK DIAGRAM OF A TMS320C5X DSP
9/28/2019 55
General-Purpose Microprocessor
Example 2 : circa 1984 : Intel 8088
~100,000 transistors
Clock speed : ~ 5 MHz
Address space : 20 bits
Bus width : 8 bits
100+ instructions
2-35 cycles per instruction
Micro-coded architecture 9/28/2019 56
Example 3 :DSP TMS 32010 1984
Clock 20 MHz
16 bits
8, 12 bits addressing space
~ 50 k transistors
~ 35 instructions
Harvard architecture
Hardware multiplier
Double length accumulator with saturation
A few special DSP instructions
Relatively inexpensive9/28/2019 57
Example 4 :General Purpose Microprocessor 2000
GHz clock speed
32-bit address or more
32-bit bus, 128-bit instructions
Complex MMU
Super scalar CPU
MMX instructions
On chip cache
Single cycle execution
32-bit floating point ALU on board
Very expensive
10s of watts of power9/28/2019 58
DSP in 2000
Clock 100 ~ 200 MHz
16-bit floating point or 32-bit floating point
16-24 bits address space
Large on-chip and off-chip memories
Single cycle execution of most instructions
Harvard architecture
Lots of special DSP instructions
50 mw to 2w power
Cheap9/28/2019 59
Example 5 :Future of DSP Microprocessor
Sufficiently unique for an independent class of applications (HDD, cell phone)
Low power consumption, low cost
High performance within power, cost
constraints (MIPS/mw, MIPS/$)
Fixed point & floating point
Better compilers - but users must be informed
Hybrid DSP/ GP systems 9/28/2019 60