Introduction to Digital Signal Processors (DSPs) · OR a6,a3,a2 ;a2 = a6 OR a3. Title: PowerPoint...

21
Introduction to Digital Signal Processors (DSPs)

Transcript of Introduction to Digital Signal Processors (DSPs) · OR a6,a3,a2 ;a2 = a6 OR a3. Title: PowerPoint...

Page 1: Introduction to Digital Signal Processors (DSPs) · OR a6,a3,a2 ;a2 = a6 OR a3. Title: PowerPoint Presentation Author: sayeekumar Created Date: 12/27/2019 9:51:42 PM ...

Introduction to Digital Signal Processors (DSPs)

Page 2: Introduction to Digital Signal Processors (DSPs) · OR a6,a3,a2 ;a2 = a6 OR a3. Title: PowerPoint Presentation Author: sayeekumar Created Date: 12/27/2019 9:51:42 PM ...

Outline/objectives

• Identify the most important DSP processor

architecture features and how they relate

to DSP applications

• Understand the types of code appropriate

for DSP implementation

Page 3: Introduction to Digital Signal Processors (DSPs) · OR a6,a3,a2 ;a2 = a6 OR a3. Title: PowerPoint Presentation Author: sayeekumar Created Date: 12/27/2019 9:51:42 PM ...

What is a DSP?

• A specialized microprocessor for real-

time DSP applications

– Digital filtering (FIR and IIR)

– FFT

– Convolution, Matrix Multiplication etc

ADC DACDSPANALOG

INPUT

ANALOG

OUTPUT

DIGITAL

INPUTDIGITAL

OUTPUT

Page 4: Introduction to Digital Signal Processors (DSPs) · OR a6,a3,a2 ;a2 = a6 OR a3. Title: PowerPoint Presentation Author: sayeekumar Created Date: 12/27/2019 9:51:42 PM ...

Hardware used in DSP

ASIC FPGA GPP DSP

Performance Very High High Medium Medium High

Flexibility Very low High High High

Power

consumption

Very low low Medium Low Medium

Development

Time

Long Medium Short Short

Page 5: Introduction to Digital Signal Processors (DSPs) · OR a6,a3,a2 ;a2 = a6 OR a3. Title: PowerPoint Presentation Author: sayeekumar Created Date: 12/27/2019 9:51:42 PM ...

Common DSP features• Harvard architecture

• Dedicated single-cycle Multiply-Accumulate (MAC) instruction (hardware MAC units)

• Single-Instruction Multiple Data (SIMD) Very Large Instruction Word (VLIW) architecture

• Pipelining

• Saturation arithmetic

• Zero overhead looping

• Hardware circular addressing

• Cache

• DMA

Page 6: Introduction to Digital Signal Processors (DSPs) · OR a6,a3,a2 ;a2 = a6 OR a3. Title: PowerPoint Presentation Author: sayeekumar Created Date: 12/27/2019 9:51:42 PM ...

Harvard Architecture

• Physically separate

memories and paths

for instruction and

data

DATA

MEMORY

PROGRAM

MEMORY

CPU

Page 7: Introduction to Digital Signal Processors (DSPs) · OR a6,a3,a2 ;a2 = a6 OR a3. Title: PowerPoint Presentation Author: sayeekumar Created Date: 12/27/2019 9:51:42 PM ...

Single-Cycle MAC unit

Multiplier

Adder

Register

a xi i

a xi i

a xi-1 i-1

a xi i a xi-1 i-1+

Σ(a x )i ii=0

n

Can compute a sum of n-

products in n cycles

Page 8: Introduction to Digital Signal Processors (DSPs) · OR a6,a3,a2 ;a2 = a6 OR a3. Title: PowerPoint Presentation Author: sayeekumar Created Date: 12/27/2019 9:51:42 PM ...

Single Instruction - Multiple Data

(SIMD)• A technique for data-level parallelism by

employing a number of processing

elements working in parallel

Page 9: Introduction to Digital Signal Processors (DSPs) · OR a6,a3,a2 ;a2 = a6 OR a3. Title: PowerPoint Presentation Author: sayeekumar Created Date: 12/27/2019 9:51:42 PM ...

Very Long Instruction Word (VLIW)

• A technique for

instruction-level

parallelism by executing

instructions without

dependencies (known at

compile-time) in parallel

• Example of a single

VLIW instruction:

F=a+b; c=e/g; d=x&y; w=z*h;

VLIW instruction F=a+b c=e/g d=x&y w=z*h

PU

PU

PU

PU

a

b

F

c

d

w

e

g

x

y

z

h

Page 10: Introduction to Digital Signal Processors (DSPs) · OR a6,a3,a2 ;a2 = a6 OR a3. Title: PowerPoint Presentation Author: sayeekumar Created Date: 12/27/2019 9:51:42 PM ...

CISC vs. RISC vs. VLIW

Page 11: Introduction to Digital Signal Processors (DSPs) · OR a6,a3,a2 ;a2 = a6 OR a3. Title: PowerPoint Presentation Author: sayeekumar Created Date: 12/27/2019 9:51:42 PM ...

Pipelining• DSPs commonly feature deep pipelines

• TMS320C6x processors have 3 pipeline stages with a number of phases (cycles):– Fetch

• Program Address Generate (PG)

• Program Address Send (PS)

• Program ready wait (PW)

• Program receive (PR)

– Decode• Dispatch (DP)

• Decode (DC)

– Execute• 6 to 10 phases

Page 12: Introduction to Digital Signal Processors (DSPs) · OR a6,a3,a2 ;a2 = a6 OR a3. Title: PowerPoint Presentation Author: sayeekumar Created Date: 12/27/2019 9:51:42 PM ...

Saturation Arithmetic• fixed range for operations like addition and

multiplication

• normal overflow and underflow produce the

maximum and minimum allowed value,

respectively

• Associativity and distributivity no longer apply

• 1 signed byte saturation arithmetic examples:• 64 + 69 = 127

• -127 – 5 = -128

• (64 + 70) – 25 = 122 ≠ 64 + (70 -25) = 109

Page 13: Introduction to Digital Signal Processors (DSPs) · OR a6,a3,a2 ;a2 = a6 OR a3. Title: PowerPoint Presentation Author: sayeekumar Created Date: 12/27/2019 9:51:42 PM ...

Examples

• Perform the following operations using

one-byte saturation arithmetic• 0x77 + 0x99 =

• 0x4*0x42=

• 0x3*0x51=

Page 14: Introduction to Digital Signal Processors (DSPs) · OR a6,a3,a2 ;a2 = a6 OR a3. Title: PowerPoint Presentation Author: sayeekumar Created Date: 12/27/2019 9:51:42 PM ...

Zero Overhead Looping

• Hardware support for loops with a

constant number of iterations using

hardware loop counters and loop buffers

• No branching

• No loop overhead

• No pipeline stalls or branch prediction

• No need for loop unrolling

Page 15: Introduction to Digital Signal Processors (DSPs) · OR a6,a3,a2 ;a2 = a6 OR a3. Title: PowerPoint Presentation Author: sayeekumar Created Date: 12/27/2019 9:51:42 PM ...

Hardware Circular Addressing

• A data structureimplementing a fixed length queue of fixed size objects where objects are added to the head of the queue while items are removed from the tail of the queue.

• Requires at least 2 pointers (head and tail)

• Extensively used in digital filtering

y[n] = a0x[n]+a1x[n-1]+…+akx[n-k]

X[n]

X[n-1]

X[n-2]

X[n-3]

X[n]

X[n-1]

X[n-2]

X[n-3]

Head

Tail

Cycle1

Cycle2

Page 16: Introduction to Digital Signal Processors (DSPs) · OR a6,a3,a2 ;a2 = a6 OR a3. Title: PowerPoint Presentation Author: sayeekumar Created Date: 12/27/2019 9:51:42 PM ...

Direct Memory Access (DMA)

• The feature that allows peripherals to access main memory without the intervention of the CPU

• Typically, the CPU initiates DMA transfer, doesother operations while the transfer is in progress, and receives an interrupt from the DMA controller once the operation is complete.

• Can create cache coherency problems (the data in the cache may be different from the data in the external memory after DMA)

• Requires a DMA controller

Page 17: Introduction to Digital Signal Processors (DSPs) · OR a6,a3,a2 ;a2 = a6 OR a3. Title: PowerPoint Presentation Author: sayeekumar Created Date: 12/27/2019 9:51:42 PM ...

Cache memory

• Separate instruction and data L1 caches

(Harvard architecture)

• Cache coherence protocols required,

since most systems use DMA

Page 18: Introduction to Digital Signal Processors (DSPs) · OR a6,a3,a2 ;a2 = a6 OR a3. Title: PowerPoint Presentation Author: sayeekumar Created Date: 12/27/2019 9:51:42 PM ...

DSP vs. Microcontroller

• DSP

– Harvard Architecture

– VLIW/SIMD (parallel

execution units)

– No bit level operations

– Hardware MACs

– DSP applications

• Microcontroller

– Mostly von Neumann

Architecture

– Single execution unit

– Flexible bit-level

operations

– No hardware MACs

– Control applications

Page 19: Introduction to Digital Signal Processors (DSPs) · OR a6,a3,a2 ;a2 = a6 OR a3. Title: PowerPoint Presentation Author: sayeekumar Created Date: 12/27/2019 9:51:42 PM ...

Examples• Estimate how long will the following code

fragment take to execute on– A general purpose processor with 1 GHz operating

frequency, five-stage pipelining and 5 cycles required for multiplication, 1 cycle for addition

– A DSP running at 500 MHz, zero overhead looping and 6 independent ALUs and 2 independent single-cycle MAC units?

for (i=0; i<8; i++)

{

a[i] = 2*i + 3;

b[i] = 3*i + 5;

}

Page 20: Introduction to Digital Signal Processors (DSPs) · OR a6,a3,a2 ;a2 = a6 OR a3. Title: PowerPoint Presentation Author: sayeekumar Created Date: 12/27/2019 9:51:42 PM ...

Review Questions

• Which of the following code fragments is appropriate for SIMD implementation?a[0]=b[0]+c[0]; a[0]=b[0]&c[0];

a[2]=b[2]+c[2]; a[0]=b[0]%c[0];

a[4]=b[4]+c[4]; a[0]=b[0]+c[0];

a[6]=b[6]+c[6]; a[0]=b[0]/c[0];

• Can the following instructions be merged into one VLIW instruction? If not in how many?– a=b+c;

– d=c/e;

– f=d&a;

– g=b%c;

Page 21: Introduction to Digital Signal Processors (DSPs) · OR a6,a3,a2 ;a2 = a6 OR a3. Title: PowerPoint Presentation Author: sayeekumar Created Date: 12/27/2019 9:51:42 PM ...

Examples• How many VLIW instructions does the following program

fragment require if there two independent data paths (a,b), with 3 ALUs and 1 MAC available in each and 8 instructions/word? How many cycles will it take to execute if they are the first instructions in the program and all instructions require 1 cycle, assuming the pipelining architecture of slide 10 with 6 phases of execution?ADD a1,a2,a3 ;a3 = a1+a2

SUB b1,b3,b4 ;b4 = b1-b3

MUL a2,a3,a5 ;a5 = a2-a3

MUL b3,b4,b2 ;b2 = b3*b4

AND a7,a0,a1 ;a1 = a7 AND a0

MUL a3,a4,a5 ;a5 = a3*a4

OR a6,a3,a2 ;a2 = a6 OR a3