Introduction to DSP processors - המחלקה להנדסת...
-
Upload
duongquynh -
Category
Documents
-
view
247 -
download
5
Transcript of Introduction to DSP processors - המחלקה להנדסת...
DSP lab 1
ד''בס
11/15/2014 1
Introduction to DSP processors
Presented by:
ד''בס
11/15/2014 2
Contents:
The modern processor’s classification;
The digital signal processing methods &
algorithms;
The D(igital) S(ignal) P(rocessing) algorithms implementation
The SHARC processor - architecture; data types & formats, C & Assembler;
Getting started.
DSP lab 2
ד''בס
11/15/2014 3
FPGA & EPLD
The modern processor’s classification.
Today chips are distributed into three groups:
ASIC’s (Application
Specific Integrated Circuits)
Chips with hardware realization
of data processing algorithms
(microprocessors & microcontrollers)
ד''בס
11/15/2014 4
The modern processor’s classification.
Microprocessors & microcontrollers
DSP microprocessors.
The processors are intended for Real-Time
Digital Signal Processing systems.
General-purpose microprocessors.
This kind of processor is intended for
computer systems: PC, workstation &
parallel supercomputer.
Microcontrollers
Very especial processors are intended for embedded systems
and in different
household devices.
DSP lab 3
ד''בס
11/15/2014 5
Review: Processor Classes
General Purpose - high performance – Pentiums, Alpha's, SPARC
– 64-128 bit word size
– Used for general purpose software
– Heavy weight OS - UNIX, NT
– Multiply layers of cache memory
– Workstations, PC's
Embedded processors and processor cores – ARM9, ARC, 486SX, Hitachi
SH7000, NEC V800
– 32 bit word size
– Single program
– Lightweight, often real-time OS
– Code and Data memory cache, DSP support
– Cellular phones, consumer electronics (e. g. CD players)
DSP processors – SHARC, BlackFin, TMS320C55x,
TMS320C67x, TMS320C64x
– 16-32 bit word size
– Single program
– Lightweight, often real-time OS
– Super Harvard Architecture Support, MAC, Circular buffer, Dual-Port RAM
– Audio, Image and Video processing, Coding and Decoding, Cellular Base Station, Adaptive Filtering, Real Time operations
Microcontrollers – PIC, AVR, HC11, ARM7, 8051,80251
– Extremely cost sensitive
– Small word size - 8 bit common
– Highest volume processors by far
– Automobiles, toasters, thermostats, ...
ד''בס
11/15/2014 6
The modern processor’s classification.
Per
form
an
ce
Cost
GP – High
performance
Microcontrollers
GP – Embedded DSP
DSP lab 4
ד''בס
11/15/2014 7
The digital signal processing
methods & algorithms.
The analog signal processing example:
fC
fjwR
iR
fR
x(t)y(t)
11
y(t)
+
-
Ri
Rf
Cf
x(t)
f fc
ד''בס
11/15/2014 8
The digital signal processing
methods & algorithms.
The digital signal processing system:
x(t)
Anti-aliasing filter
x’(t) A/D
x’(n)
D/A y’(t)
Smoothing filter y(t)
Digital filter or digital
transform
y’(n)
N
kknxk
kCny
0
fc
|H(f)|
DSP lab 5
ד''בס
The DSP vs. ASP
5th Order Analog Sallen Key Low-Pass Filter
5th Order Digital Filter of Direct Form II
11/15/2014 9
ד''בס
11/15/2014 10
The digital signal processing
methods & algorithms.
Analog signal processing:
Cheaper;
More compact;
Power dissipation.
Digital signal processing:
More accurate;
More stable for different environments.
Analog signal processing versus Digital signal processing
DSP lab 6
ד''בס
11/15/2014 11
The digital signal processing
methods & algorithms.
Time sampling: Amplitude quantization:
The basis concepts of DSP:
D
Df
TT
f1
or 1
FTf D
2
1or 2F
x(nT) – one sample at time T;
T – sample rate (time);
fD - sampling frequency:
Niquest frequency:
F – the highest frequency of signal.
x(nT) ~ x(n) ;
Resolution:
eQ quantization error:
NR 2
12max N
Qe
N– bit number.
ד''בס
11/15/2014 12
The digital signal processing
methods & algorithms. The basis concepts of DSP:
time.algorithm
time;sample
a
T
T
a del
.lg
;
;
1
del
orithmns in af operatio- number oN
uencyr clk freq- processof
timeoperation than
fNT
CLKCPU
op
CLKCPU
opopaa
DSP lab 7
ד''בס
11/15/2014 13
The D(igital) S(ignal) P(rocessing)
algorithms implementation
The major tasks in DSP:
Filter design (Linear Filtering);
Speech detection, image recognition (Spectral Analysis);
Image & Speech compression (Timing-Frequency Analysis);
Image & signal processing (Adaptation Filtering);
Coding, median filters (Non-Linear processing);
Interpolation & decimation (Multi Speed Processing).
ד''בס
11/15/2014 14
The D(igital) S(ignal) P(rocessing)
algorithms implementation
FIR filter
IIR filter
FFT
STD
Convolution
Correlation
The very usable DSP algorithms.
DSP lab 8
ד''בס
11/15/2014 15
The D(igital) S(ignal) P(rocessing)
algorithms implementation
FIR – filter
(Finite Impulse Response)
1
0
N
i
i inxbny
+
x(n)
+
Z-1 Z-1
+
+
y(n)
b2 b1 b0
ד''בס
11/15/2014 16
The D(igital) S(ignal) P(rocessing)
algorithms implementation
IIR – filter (Infinite Impulse Response)
1
0
1
1
N
i
M
k
ki knyainxbny
x(n)
y(n)
–
+
–
+
Z-1 Z-1
+
+
b2 b1 b0
a1 a0
DSP lab 9
ד''בס
11/15/2014 17
The D(igital) S(ignal) P(rocessing)
algorithms implementation
1,...,0,1
)(
21
0
NkenxN
kX N
iknN
n
Discrete Fourier Transform
N
ikn
kn
N eW
2
ד''בס
11/15/2014 18
The D(igital) S(ignal) P(rocessing)
algorithms implementation
Frequency Domain INVERSE DFT Time Domain
Frequency Domain DFT Time Domain
1
0
)(1
)(N
n
nk
NWnxN
kX
1
0
)()(N
k
nk
NWkXnx
THE COMPLEX DFT
DSP lab 10
ד''בס
11/15/2014 19
The D(igital) S(ignal) P(rocessing)
algorithms implementation
X(0) = x(0)W80 + x(1)W8
0 + x(2)W80 + x(3)W8
0 + x(4)W80 + x(5)W8
0 + x(6)W80 + x(7)W8
0
X(1) = x(0)W80 + x(1)W8
1 + x(2)W82 + x(3)W8
3 + x(4)W84 + x(5)W8
5 + x(6)W86 + x(7)W8
7
X(2) = x(0)W80 + x(1)W8
2 + x(2)W84 + x(3)W8
6 + x(4)W88 + x(5)W8
10 + x(6)W812 + x(7)W8
14
X(3) = x(0)W80 + x(1)W8
3 + x(2)W86 + x(3)W8
9 + x(4)W812 + x(5)W8
15 + x(6)W818 + x(7)W8
21
X(4) = x(0)W80 + x(1)W8
4 + x(2)W88 + x(3)W8
12 + x(4)W816 + x(5)W8
20 + x(6)W824 + x(7)W8
28
X(5) = x(0)W80 + x(1)W8
5 + x(2)W810 + x(3)W8
15 + x(4)W820 + x(5)W8
25 + x(6)W830 + x(7)W8
35
X(6) = x(0)W80 + x(1)W8
6 + x(2)W812 + x(3)W8
18 + x(4)W824 + x(5)W8
30 + x(6)W836 + x(7)W8
42
X(7) = x(0)W80 + x(1)W8
7 + x(2)W814 + x(3)W8
21 + x(4)W828 + x(5)W8
35 + x(6)W842 + x(7)W8
49
1
0
)(1
)(N
n
nk
NWnxN
kXTHE 8-POINT DFT:
ד''בס
11/15/2014 20
The D(igital) S(ignal) P(rocessing)
algorithms implementation
Direct computation of the DFT is basically
inefficient because it does not exploit the symmetry
and periodicity properties of the phase factor WN. In
particular, these two properties are:
Symmetry property:
Periodicity property:
k
N
Nk
N WW 2/
k
N
Nk
N WW
X(7) = x(0)W80+x(1)W8
7+x(2)W86+x(3)W8
5+x(4)W84+x(5)W8
3+x(6)W82+x(7)W8
1
X(7) = x(0)W88+x(1)W8
7+x(2)W814+x(3)W8
21+x(4)W828+x(5)W8
35+x(6)W842+x(7)W8
49
DSP lab 11
ד''בס
11/15/2014 21
The D(igital) S(ignal) P(rocessing)
algorithms implementation
x(7) WN
4
x(3)
x(5) WN
4
x(1)
X(7)
X(6)
X(5)
X(4)
WN0
WN0
WN6
WN4
WN2
WN0
x(6) WN
4
x(2)
x(4) WN
4
x(0)
X(3)
X(2)
X(1)
X(0)
WN0
WN0
WN6
WN4
WN2
WN0
WN7
WN6
WN5
WN4
WN3
WN2
WN1
WN0
X(7) = x(0)W80 + x(1)W8
7 + x(2)W86 + x(3)W8
13 + x(4)W84 + x(5)W8
11 + x(6)W810 + x(7)W8
17
X(7) = x(0)W88 + x(1)W8
7 + x(2)W86 + x(3)W8
5 + x(4)W84 + x(5)W8
3 + x(6)W82 + x(7)W8
1
(W88= W8
0, W813= W8
5, W811= W8
3, W810= W8
2, W817= W8
1)
X(0) + X(4)WN4
X(2) + X(6)WN4
X(0) + X(4)WN4+
X(2)WN6+ X(6)WN
10
X(1) + X(5)WN4
X(3) + X(7)WN4
X(1) + X(5)WN4+
X(3)WN6+ X(7)WN
10
ד''בס
11/15/2014 22
The D(igital) S(ignal) P(rocessing)
algorithms implementation
x(7) WN
4
x(3)
x(5) WN
4
x(1)
X(7)
X(6)
X(5)
X(4)
WN0
WN0
WN6
WN4
WN2
WN0
x(6) WN
4
x(2)
x(4) WN
4
x(0)
X(3)
X(2)
X(1)
X(0)
WN0
WN0
WN6
WN4
WN2
WN0
WN7
WN6
WN5
WN4
WN3
WN2
WN1
WN0
X(2) = x(0)W80 + x(1)W8
2 + x(2)W84 + x(3)W8
6 + x(4)W88 + x(5)W8
10 + x(6)W812 + x(7)W8
14
X(2) = x(0)W80 + x(1)W8
2 + x(2)W84 + x(3)W8
6 + x(4)W80 + x(5)W8
2 + x(6)W84 + x(7)W8
6
(W88= W8
0, W810= W8
2, W812= W8
4, W814= W8
6)
X(0) + X(4)WN0
X(2) + X(6)WN0
X(0) + X(4)WN0+
X(2)WN4+ X(6)WN
4
X(1) + X(5)WN0
X(3) + X(7)WN0
X(1) + X(5)WN0+
X(3)WN4+ X(7)WN
4
000 – 000
100 – 001
010 – 010
110 – 011
001 – 100
101 – 101
011 – 110
111 - 111
Bit reverse operations
DSP lab 12
ד''בס
The D(igital) S(ignal) P(rocessing)
algorithms implementation
11/15/2014 23
Cross correlation is a measure of similarity of two waveforms as a function of a time-lag applied to one of them
ד''בס
The D(igital) S(ignal) P(rocessing)
algorithms implementation
11/15/2014 24
Discrete correlation
MmmngmfN
nyM
m
0 1
0
Discrete autocorrelation
MmmnfmfN
nyM
m
0 1
0
DSP lab 13
ד''בס
The D(igital) S(ignal) P(rocessing)
algorithms implementation
11/15/2014 25
Convolution of two square pulses: the resulting waveform is a
triangular pulse. One of the functions (in this case g) is first reflected
about τ = 0 and then offset by t, making it g(t − τ). The area under the
resulting product gives the convolution at t. The horizontal axis
is τ for f and g, and t for f*g.
Convolution of two square pulses: the resulting waveform is a triangular pulse. One of the functions (in this case g) is first reflected about τ = 0 and then offset by t, making it g(t − τ). The area under the resulting product gives the convolution at t. The horizontal axis is τ for f andg, and t for .
Convolution is a mathematical operation on two functions f and g,
producing a third function that is typically viewed as a modified
version of one of the original functions.
ד''בס
The D(igital) S(ignal) P(rocessing)
algorithms implementation
11/15/2014 26
Discrete convolution
M
Mm
M
Mm
mgmnfmngmfngnfny *
Circular discrete convolution
mngkMmfnyM
m
N
Nk
1
0
STD – standard deviation
1
0
21 N
i
iN xxN
S
DSP lab 14
ד''בס
11/15/2014 27
The DSP processor ‘s architecture
Requirement for DSP processors:
1. High speed input data, different interface devices;
2. Input data wide dynamic range;
3. ADD, MULT & SHIFT hardware implementation. Parallel processing;
4. Flexible processing (possibility to “jump” from one process to another);
5. Algorithm’s regularity (Operation “come back”);
DSP processors features
1. Various interface highspeed ports and timers
2. Parallel access memory architecture;
3. Three mathematical units: ALU, barrel Shifter and Multiplier with fast MAC operation (MBR = MBR + Rx * Ry);
4. Cycles, branches & interrupt fast handling. Addressing special modes;
5. Circular buffer.
ד''בס
11/15/2014 28
Data types & formats.
Data types in DSP processors algorithms:
Integer (cycles, coefficients and arrays numbers);
Real (input & output data);
Complex (applications in frequency domain);
Logic (bitwise operation).
Data format in DSP processors :
Byte – 8 bit;
Short word – 16 bit;
Normal word – 32 bit;
Instruction word – 48 bit;
Extended normal word – 40 bit;
Long word – 64 bit.
DSP lab 15
ד''בס
Fixed-Point Design
Digital signal processing algorithms
– Often developed in floating point
– Later mapped into fixed point for digital hardware realization
Fixed-point digital hardware
– Lower area
– Lower power
– Lower per unit production cost
Idea
Floating-Point Algorithm
Quantization
Fixed-Point Algorithm
Code Generation
Target System
Alg
orith
m L
evel
Imple
menta
tion
Level
Range Estimation
ד''בס
X S
Fixed-Point Representation
Fixed point type
– Wordlength
– Integer wordlength
Quantization modes
– Round
– Truncation
Overflow modes
– Saturation
– Saturation to zero
– Wrap-around
S X X X X X
Wordlength
Integer wordlength
X X X X X
Wordlength
X
X
DSP lab 16
ד''בס Overflow Handling in Fixed Point Computations
Overflow handling is an important consideration when implementing signal processing algorithms. If overflow is not controlled appropriately it can lead to problems such as detection
errors, or poor quality audio output. Typical digital signal processing CPUs include hardware support for handling overflow. Some RISC processors may include these modes as well.
(In fact I helped define and implement such modes for the 32 bit MIPS processor core used in many Broadcom products). These processors often have a “saturating” mode that sets an
instruction result to a minimum or maximum value on an overflow condition. (The term “saturating” comes from analog electronics, in which an amplifier output will be limited, or
clipped, between fixed values when a large input is applied.) Commonly the CPU will limit the result to a 32 bit twos complement integer (0x7FFFFFFF or 0x80000000). For
unsigned operations, the result would be limited to 0xFFFFFFFF. There are a number of situations in which overflow can occur, and I will discuss some of them below.
Addition and Subtraction
Overflow with twos complement integers occurs when the result of an addition or subtraction is larger the largest integer that can be represented, or smaller than the smallest integer.
In fixed point representation, the largest or smallest value depends on the format of the number. I will assume Q31 in a 32 bit register for any examples that follow. In this case, a CPU
with saturation arithmetic would set the result to -1 or (just below) +1 on an overflow, corresponding to the integer values 0x80000000 and 0x7FFFFFFF.
Overflow in addition can only occur when the sign of the two numbers being added is the same. Overflow in subtraction can occur only when a negative number is subtracted from a
positive number, or when a positive number is subtracted from a negative number.
Negation
There is one case where negation of a number causes an overflow condition. When the smallest negative number is negated, there is no way to represent the corresponding positive
value in twos complement. For example, the value -1 in Q31 is 0x80000000. When this number is negated (flip the bits and add one) the result is again -1. If the saturation mode is set,
then the CPU will set the result to 0x7FFFFFFF (just less than +1).
Arithmetic Shift
Overflow can occur when shifting a number left by 1 to n bits. In fixed point computations, left shifting is used to multiply a fixed point value by a power of two, or to change the
format of a number (Q15 to Q31 for example). Again, many CPUs have saturation modes to set the output to the minimum or maximum 32 bit integer (depending on whether the
original number was positive or negative). Furthermore, a common feature is an instruction that counts the number of leading ones or zeros in a number. This helps the programmer
avoid overflow since the number of leading sign bits determines how large a shift can be done without causing overflow.
Overflow will not occur when right shifting a number.
Multiplication
Overflow doesn’t really occur during multiplication if the result register has enough bits (32 bits if two 16 bit numbers are multiplied). But it is partly a matter of interpretation. When
multiplying a fixed point value of -1 by -1 (0x8000 by 0x8000 using Q15 numbers), the result is +1. If the result is interpreted as a Q1.30 number (one integer bit and 30 fractional
bits) then there is no problem. If the result is to be a Q30 number (no integer bits) then an overflow condition has occurred. And if the number was to be converted to Q31 (by shifting
the result left by 1) then an overflow would occur during the left shift. The overall affect would be that -1 times -1 equals -1.
I have used a CPU that handles this special case with saturation hardware. Some CPUs have a multiplication mode that shifts the product left by one bit after a multiply operation. The
reason for doing so is to create a Q31 result when two Q15 numbers are multiplied. Then if a Q15 result is desired, it can be found by storing the upper 16 bits of the result register (if
the register is only 32 bits). The saturating mode automatically sets the result to 0x7FFFFFFF when the number 0x8000 is multiplied by itself, and the “shift left by one” multiplication
mode is enabled.
A very often used operation in DSP algorithms is the “multiply accumulate” or “MAC”, where a series of numbers is multiplied and added to a running sum. I would recommend not
using the “left shift by one” mode if possible when doing MACs, since this only increases the chance for overflow. A better technique is to keep the result as Q1.30, and then handle
overflow if converting the final result to Q31 or Q15 (or whatever). This is also a good technique to use on CPUs without saturation modes, since the number of overflow checks can
be greatly reduced in some cases.
Division
Overflow in division can occur when the result would have more bits than was calculated. For example, if the magnitude of the numerator is several times larger than that of the
denominator, than the result must have enough bits to represent numbers larger than one. Overflow can be avoided by carefully considering the range of numbers being operated on,
and calculating enough bits for the result. I have not seen a CPU that implements a saturation mode for division.
11/15/2014 31
ד''בס
Optimum Wordlength
Longer wordlength – May improve application
performance – Increases hardware cost
Shorter wordlength – May increase quantization errors
and overflows – Reduces hardware cost
Optimum wordlength – Maximize application performance
or minimize quantization error – Minimize hardware cost
Wordlength (w)
Cost c(w) Distortion d(w)
[1/performance]
Optimum
wordlength
DSP lab 17
ד''בס
Filter Implementation
Finite word-length effects (fixed point
implementation)
• Coefficient quantization
• Overflow & quantization in arithmetic operations • scaling to prevent overflow
• quantization noise statistical modeling
• limit cycle oscillations
ד''בס
Coefficient Quantization
The coefficient quantization problem :
Filter design in Matlab (e.g.) provides filter coefficients to 15
decimal digits (such that filter meets specifications)
For implementation, need to quantize coefficients to the word
length used for the implementation.
As a result, implemented filter may fail to meet specifications… ??
PS: In present-day signal processors, this has become less of a problem
(e.g. with 16 bits (=4 decimal digits) or 24 bits (=7 decimal digits)
precision). In hardware design, with tight speed requirements, this is still
a relevant problem.
DSP lab 18
ד''בס
Coefficient Quantization
Coefficient quantization effect on pole locations :
1. tightly spaced poles (e.g. for narrow band filters) imply high sensitivity of pole locations to coefficient quantization
2. hence preference for low-order systems (parallel/cascade)
Example: Implementation of a band-pass IIR 12-order filter
Cascade structure with 16-bit coeff. Direct form with 16-bit coeff.
ד''בס
Coefficient Quantization
Coefficient quantization effect on pole locations :
example : 2nd-order system (e.g. for cascade
realization)
21
21
..1
..1)(
zz
zzzH
ii
iii
DSP lab 19
ד''בס
Coefficient Quantization
example (continued) :
with 5 bits per coefficient, all possible pole positions are...
Low density of permissible pole locations at z=1, z=-1, hence
problem for narrow-band LP and HP filters
-1.5 -1 -0.5 0 0.5 1 1.5-1.5
-1
-0.5
0
0.5
1
1.5
end
end
)plot(roots
1:0625.0:1for
2:1250.0:2for
i
i
ד''בס
Coefficient Quantization
example (continued) :
possible remedy: `coupled realization’
poles are where are realized/quantized
hence permissible pole locations are (5 bits)
-1.5 -1 -0.5 0 0.5 1 1.5
-1.5
-1
-0.5
0
0.5
1
1.5
.j ,
+
+ +
-
y[k]
u[k]
DSP lab 20
ד''בס
Quantization of an FIR filter
Transfer function ΔH(z)
The effect of coefficient quantization to
linear phase
ד''בס
FIR filter example
Passband attenuation 0.01, Radial frequency (0,0.4)
Stopband attenuation 0.001, Radial frequency (0.4, )
DSP lab 21
ד''בס
FIR filter example – 16bits
ד''בס
FIR filter example - 8bits
DSP lab 22
ד''בס
Arithmetic Operations
Finite word-length effects in arithmetic operations:
In linear filters, have to consider additions & multiplications
Addition:
if, two B-bit numbers are added, the result has (B+1) bits.
Multiplication:
if a B1-bit number is multiplied by a B2-bit number, the
result has (B1+B2-1) bits.
For instance, two B-bit numbers yield a (2B-1)-bit product
Typically (especially so in an IIR (feedback) filter), the result of an addition/multiplication has to be represented again as a B’-bit number (e.g. B’=B). Hence have to get rid of either most significant bits or least significant bits…
ד''בס
Arithmetic Operations
Option-1: Most significant bits
If the result is known to be upper bounded so that the most significant
bit(s) is(are) always redundant, it(they) can be dropped, without loss of
accuracy.
This implies we have to monitor potential overflow, and introduce scaling
strategy to avoid overflow.
Option-2 : Least significant bits
Rounding/truncation/… to B’ bits introduces quantization noise.
The effect of quantization noise is usually analyzed in a statistical manner.
Quantization, however, is a deterministic non-linear effect, which may give
rise to limit cycle oscillations.
DSP lab 23
ד''בס
Scaling
The scaling problem:
Finite word-length implementation implies maximum
representable number. Whenever a signal (output or
internal) exceeds this value, overflow occurs.
Digital overflow may lead (e.g. in 2’s-complement
arithmetic) to polarity reversal (instead of saturation
such as in analog circuits), hence may be very harmful.
Avoid overflow through proper signal scaling
Scaled transfer function may be c*H(z) instead of H(z)
(hence need proper tracing of scaling factors)
ד''בס
Scaling
Time domain scaling:
Assume input signal is bounded in magnitude
(i.e. u-max is the largest number that can be represented in the `words’
reserved for the input signal’)
Then output signal is bounded by
To satisfy
(i.e. y-max is the largest number that can be represented in the `words’
reserved for the output signal’)
we have to scale H(z) to c.H(z), with
max][ uku
1max
0
max
00
.][.][.][][].[][ huihuikuihikuihkyiii
1max
max
. hu
yc
max][ yky
DSP lab 24
ד''בס
Scaling
Example:
assume u[k] comes from 12-bit A/D-converter
assume we use 16-bit arithmetic for y[k] & multiplier
hence inputs u[k] have to be shifted by
3 bits to the right before entering the filter
(=loss of accuracy!)
y[k]
u[k] +
x
0.99
10099.01
1...
.99.01
1)(
1
1
h
zzH
3
1
12
16
2
116.0
.2
2
hc
y[k]
u[k]
+
x
0.99
shift
ד''בס
Scaling
L2-scaling: (`scaling in L2 sense’)
Time-domain scaling is simple & guarantees that overflow will
never occur, but often over-conservative (=too small c)
If an `energy upper bound’ for the input signal is known
then L2-scaling uses
where
…is an L2-norm (this leads to larger
c)
1max
max
. hu
yc
0k
2
max u[k]UE
0
2
2][
i
ihh
2max
max
. hE
yc
U
DSP lab 25
ד''בס
Scaling
So far considered scaling of H(z), i.e. transfer function
from u[k] to y[k]. In fact we also need to consider
overflow and scaling of each internal signal, i.e. scaling of
transfer function from u[k] to each and every internal
signal !
This requires quite some thinking….
(but doable)
x
bo
x
b4
x
b3
x
b2
x
b1
+ + + +
y[k]
+ + + +
x
-a4
x
-a3
x
-a2
x
-a1
x1[k] x2[k] x3[k] x4[k]
ד''בס
Scaling
Something that may help: If 2’s-complement arithmetic is used, and
if the sum of K numbers (K>2) is guaranteed not to overflow, then
overflows in partial sums cancel out and do not affect the final
result (similar to `modulo arithmetic’).
Example:
if x1+x2+x3+x4 is guaranteed not to
overflow, then if in (((x1+x2)+x3)+x4)
the sum (x1+x2) overflows, this overflow
can be ignored, without affecting the
final result.
As a result (1), in a direct form realization,
eventually only 2 signals have to be
considered in view of scaling :
x
bo
x
b4
x
b3
x
b2
x
b1
+ + + +
+ + + +
x
-a4
x
-a3
x
-a2
x
-a1
x1[k] x2[k] x3[k] x4[k]
DSP lab 26
ד''בס
Scaling
As a result (2), in a transposed direct form realization, eventually
only 1 signal has to be considered in view of scaling……….:
hence preference for transposed direct form over direct form.
u[k]
x
-a4
x
-a3
x
-a2
x
-a1
y[k]
x
bo
x
b4
x
b3
x
b2
x
b1
+ + + +
x1[k] x2[k] x3[k] x4[k]
ד''בס
Quantization Noise
The quantization noise problem :
If two B-bit numbers are added (or multiplied), the result is a B+1 (or 2B-1) bit number. Rounding/truncation/… to (again) B bits, to get rid of the least significant bit(s) introduces quantization noise.
The effect of quantization noise is usually analyzed in a statistical manner.
Quantization, however, is a deterministic non-linear effect, which may give rise to limit cycle oscillations.
PS: Will focus on multiplications only. Assume additions are implemented with sufficient number of output bits, or are properly scaled, or…
DSP lab 27
ד''בס
Quantization Noise
Quantization mechanisms:
Rounding Truncation Magnitude Truncation
mean=0 mean=(-0.5)LSB (biased!) mean=0
variance=(1/12)LSB^2 variance=(1/12)LSB^2 variance=(1/6)LSB^2
input
probability
error
output
ד''בס
Quantization Noise
Statistical analysis based on the following assumptions :
- each quantization error is random, with uniform probability
distribution function (see previous slide)
- quantization errors at the output of a given multiplier are
uncorrelated/independent (=white noise assumption)
- quantization errors at the outputs of different multipliers are
uncorrelated/independent (=independent sources assumption)
One noise source is inserted for each multiplier.
Since the filter is linear filter the output noise generated by
each noise source is added to the output signal.
DSP lab 28
ד''בס
Quantization Noise
The effect on the output signal of noise generated at a
particular point in the filter is computed as follows:
noise is e[k]. noise mean & variance are
transfer function from from e[k] to filter output is G(z),g[k]
(‘noise transfer function’)
Noise mean at the output is
Noise variance at the output is (remember L2-norm!)
Repeat procedure for each noise source…
y[k]
u[k] +
x
-.99
+ e[k]
2, ee
1)(.gain)DC.(
zee zG
2
2
2
0
22
222
.][.
))(2
1.()gain'-noise.(`
gkg
deG
e
k
e
j
ee
ד''בס
Quantization Noise
In a transposed direct realization all `noise transfer
functions’ are equal (up to delay), hence all noise
sources can be lumped into one equivalent source
etc...
u[k]
x
-a4
x
-a3
x
-a2
x
-a1
y[k]
x
bo
x
b4
x
b3
x
b2
x
b1
+ + + +
x1[k] x2[k] x3[k] x4[k]
e[k]
DSP lab 29
ד''בס
Quantization Noise
In a direct realization all noise sources can be lumped into
two equivalent sources
etc...
e1[k]
x
bo
x
b4
x
b3
x
b2
x
b1
+ + + +
y[k]
+ + + +
x
-a4
x
-a3
x
-a2
x
-a1
x1[k] x2[k] x3[k] x4[k]
u[k]
e2[k]
ד''בס
Quantization Noise
PS: Quantization noise of A/D-converters can be
modeled/analyzed in a similar fashion.
Noise transfer function is filter transfer function H(z).
DSP lab 30
ד''בס
Limit Cycles
Statistical analysis is simple/convenient, but quantization
is truly a non-linear effect, and should be analyzed as a
deterministic process.
Though very difficult, such analysis may reveal odd
behavior:
Example: y[k] = -0.625.y[k-1]+u[k]
4-bit rounding arithmetic
input u[k]=0, y[0]=3/8
output y[k] = 3/8, -1/4, 1/8, -1/8, 1/8, -1/8, 1/8, -1/8,
1/8,..
Oscillations in the absence of input (u[k]=0) are called
`zero-input limit cycle oscillations’.
Copyright Marc Moonen [1]
ד''בס
Limit Cycles
Example: y[k] = -0.625.y[k-1]+u[k]
4-bit truncation (instead of rounding)
input u[k]=0, y[0]=3/8
output y[k] = 3/8, -1/4, 1/8, 0, 0, 0,.. (no limit cycle!)
Example: y[k] = 0.625.y[k-1]+u[k]
4-bit rounding
input u[k]=0, y[0]=3/8
output y[k] = 3/8, 1/4, 1/8, 1/8, 1/8, 1/8,..
Example: y[k] = 0.625.y[k-1]+u[k]
4-bit truncation
input u[k]=0, y[0]=-3/8
output y[k] = -3/8, -1/4, -1/8, -1/8, -1/8, -1/8,..
Conclusion: weird, weird, weird,… ! Copyright Marc Moonen [1]
DSP lab 31
ד''בס
Limit Cycles
Limit cycle oscillations are clearly unwanted (e.g. may be
audible in speech/audio applications)
Limit cycle oscillations can only appear if the filter has
feedback. Hence FIR filters cannot have limit cycle
oscillations.
Mathematical analysis is very difficult.
Truncation often helps to avoid limit cycles (e.g. magnitude
truncation, where absolute value of quantizer output is
never larger than absolute value of quantizer input
(`passive quantizer’)).
Some filter structures can be made limit cycle free, e.g.
coupled realization, orthogonal filters (see below).
ד''בס
11/15/2014 62
The DSP processor ‘s architecture.
DSP processors with fixed and floating point.
Fixed versus Floating:
Fixed point arithmetic operations are more simple for hardware realization;
Floating point DSP processor has more data types and commands;
Floating point advantages:
Increases accuracy;
Wide dynamic range;
Doesn’t have problem with data overflow;
Friendly for C compiler.
Fixed point advantages:
Cheaper;
Compact.
DSP lab 32
ד''בס
11/15/2014 63
Data types & formats.
Dynamic range:
or in [db]:
maximum linearity error :
max. precision [bits]
(b – data width):
0 min
max
volue
volueDynR
0 min
maxlog20
volue
voluedbDynR
b2
erroron quantizatimax
valuemaxlog2
ד''בס
C vs. Assembly
DSP programs are different from traditional software tasks in two important respects. – First, the programs are usually much shorter, say, one-
hundred lines versus ten-thousand lines.
– Second, the execution speed is often a critical part of the application.
If assembly is used at all, it is restricted to short subroutines that must run with the utmost speed.
DSP lab 33
ד''בס
C vs. Assembly
Programs in C are more flexible and quick to develop.
Programs in assembly often have better performance , they run faster and use less memory, resulting in lower cost.
11/15/2014 65
ד''בס
C vs. Assembly
Which language is best for your application?
– If you need flexibility and fast development, choose C.
– use assembly if you need the best possible performance.
How complicated is the program? – If it is large and intricate use C. – If it is small and simple, assembly
may be a good choice.
Are you pushing the maximum speed of the DSP?
– If so, assembly will give you the last drop of performance from the device.
– For less demanding applications, assembly has little advantage, and you should consider using C.
How many programmers will be working together?
– If the project is large enough for more than one programmer, lean toward C and use in-line assembly only for time critical segments.
Which is more important, product cost or development cost?
– If it is product cost, choose assembly;
– If it is development cost, choose C.
What is your background? – If you are experienced in assembly
(on other microprocessors), choose assembly for your DSP.
– If your previous work is in C, choose C for your DSP.
DSP lab 34
ד''בס
11/15/2014 67
The DSP processor ‘s architecture.
“Traditional” fon Neiman architecture
Harvard architecture
CPU Memory
data & instruction
Address bus
Data bus
Program Memory instruction
only PM data bus
Data Memory
data only DM data bus CPU
DM address bus PM address bus
ד''בס
11/15/2014 68
The DSP processor ‘s architecture.
Super Harvard architecture
I/O Controller
Data
Program Memory instruction
only PM data bus
Data Memory
data only DM data bus
CPU DM address bus PM address bus
Instruction Cache
This is SHARC DSP processor structure
DSP lab 35
ד''בס
11/15/2014 69
The DSP processor ‘s architecture.
SHARC DSP processor structure
ד''בס
11/15/2014 70
The DSP processor ‘s architecture.
The ADSP-21160 hardware structure.
SERIAL PORTS
(2)
LINK PORTS
(6)
DMA
CONTROLLER
ADDR BUS
MUX
IOD
64
IOA
18
IOP
REGISTERS
6
6
6x10
4
Dual-Ported SRAM
External Port
I/O Processor
PROCESSOR
PORT I/O
PORT ADDR DATA ADDR DATA
Two Independent,
Dual-Ported Memory
Blocks
ADDR DATA ADDR DATA
MULTIPROCESSOR
32
64
HOST PORT
INTERFACE
PM Address Bus 32
DM Address Bus 32
PM Data Bus 16/32/40/48/64
DM Data Bus 32/40 64
INSTRUCTION
CACHE 32 x 48-Bit
DA G 2 8 x 4 x 32
DA G 1 8 x 4 x 32
Core Processor
PROGRAM
SEQUENCER
TIMER
Connect
Bus
(PX)
7 JTAG
Test &
Emulation
P M D
D M D
E P D
I O D
BL
OC
K 0
BL
OC
K 1
DATA BUS
MUX
MULTIPLIER BARREL
SHIFTER ALU
DATA
REGISTER
FILE
16 x 40-Bit
DSP lab 36
ד''בס
11/15/2014 71
The DSP processor ‘s architecture.
100 MHz - 600 MFLOPS- SIMD Core
1024 point, complex FFT benchmark: 90 us
4 Mbits on chip SRAM
14 zero overhead DMA channels
Sustained 700 Mbyte/sec over IOP bus
Two 50 mbit/sec Synchronous Serial Ports
Six 100 Mbyte/sec link ports
64 bit synchronous external port
Cluster multiprocessing support
ADSP-21160 Features
ד''בס
11/15/2014 72
Peak (technical) performance of microprocessor: Maximum theoretical microprocessor’s speed in ideal conditions. It’s
defined by number of calculating operation which had done in some time.
Real (sustained) performance of microprocessor: Real microprocessor’s speed in real conditions. The real performance is
calculated by execution of some popular programs. (like FIR,IIR or FFT).
The methods for computer performance measurement
The digital signal processing
methods & algorithms.
DSP lab 37
ד''בס
11/15/2014 73
The DSP processor ‘s architecture.
Pipe-Line command execution:
Instruction fetching (a);
Decoding (b);
Execution (c).
n-1 operation
n operation
n+1 operation
a b c
a b c
a b c
ד''בס
11/15/2014 74
SHARC instruction set
SHARC programming model.
SHARC assembly language.
SHARC data operations.
SHARC flow of control.
DSP lab 38
ד''בס
SHARC programming model
Register files:
R0-R15 (aliased as F0-F15 for floating point)
Status registers.
Loop registers.
Data address generator registers.
ד''בס
11/15/2014 76
SHARC assembly language
R1=DM(M0,I0), R2=PM(M8,I8); // comment
label: R3=R1+R2;
data memory access program memory access
Algebraic notation terminated by semicolon:
DSP lab 39
ד''בס
11/15/2014 77
Simple ALU Instructions
Rn = Rx + Ry Fn = Fx + Fy
Rx = Rx – Ry Fn = Fx - Fy
Rn = Rx + Ry + CI (Carry In) Fn = ABS(Fx + Fy)
Rn = Rx - Ry + CI - 1 Fn = ABS(Fx – Fy)
Rn = (Rx + Ry)/2 Fn = (Fx + Fy)/2
COMP(Rx, Ry) COMP(Fx, Fy)
Rn = Rx + CI – 1 Fn = - Fx
Rn = Rx + 1 Fn= ABS Fx
Rn = Rx – 1 Fn= PASS Fx
Rn = -Rx Fn = RND Fx
Rn = ABS Rx Fn = SCALB Fx BY Ry
Rn = PASS Rx Rn = MANT Fx
Rn = Rx AND Ry Rn = LOGB Fx
Rn = Rx OR Ry Rn = FIX Fx BY Ry
Rn = NOT Rx Fn = FLOAT Rx BY Ry
Rn = MIN(Rx, Ry) Rn = TRUNC Fx
Rn = MAX(Rx, Ry) Fn = RECIPS Fx
ד''בס
11/15/2014 78
MAC instructions - mainly INTEGER
Multiply and Accumulate
Rn = Rx * Ry MRF = Rx * Ry
MRB = Rx * Ry Rn = MRF + Rx * Ry
Rn = MRB + Rx * Ry MRF = MRF + Rx * Ry
MRB = MRB + Rx * Ry Rn = MRF – Rx * Ry
Rn = MRB – Rx * Ry MRF = MRF – Rx * Ry
MRB = MRB – Rx * Ry Rn = SAT MRF
Rn = SAT MRB MRF = SAT MRF
MRB = SAT MRB Rn = RND MRF
Rn = RND MRB MRF = RND MRF
MRB = RND MRB MR = Rn
Rn = MR FLOAT – Fx * Fy
DSP lab 40
ד''בס
11/15/2014 79
Shifter Instructions - mainly integer
FPACK is a cast and means (32bit -> 16bit) Fx
UNPACK is a cast and means (16bit -> 32bit) Rx
BUT WITH A LOT OF HIDDEN STUFF TOO!
Rn = LSHIFT Rx BY Ry/<dataa8>
Rn = Rn OR LSHIFT Rx BY Ry/<data8>
Rn = ASHIFT Rx BY Ry/<data8>
Rn = ROT Rx BY Ry/<data8>
Rn = BCLR Rx BY Ry/<data8>
Rn = BSET Rx BY Ry/<data8>
Rn = BTGL Rx BY
Rx/<data8>
BTST Rx BY Ry/<data8>
Rn = Rn OR FDEP Rx BY Ry/<bit6>:<len6> (SE)
Rn = Rx BY Ry/<bit 6>:<len6> (SE)
Rn = EXP Rx (EX) Rn = LEFTZ Rx
Rn = LEFT0 Rx Rn = FPACK Fx
Fn = UNPACK Rx
ד''בס
11/15/2014 80
Flag operations
ALU operations set: AZ (zero), AN (negative), AV (overflow), AC (fixed-point carry), AI (floating-point invalid), AF (last ALU operation).
Multiplier operations set: MN (negative), MV (overflow), MU (flouting point overflow), MI (floating-point invalid).
Shifter operations set: SV (overflow), SZ (zero), SS (sign).
Fixed-point: -1 + 1 = 0:
AZ = 1, AN = 0, AV = 0, AC = 1, AI = 0, AF = 0.
Fixed-point: -2*3=-6:
MN = 1, MV = 0, MU = 1, MI = 0.
LSHIFT 0x7fffffff BY 3: SV=1, SZ=0, SS=0.
DSP lab 41
ד''בס
Multifunction computations
The modified Harvard architecture allows – multiple data fetches in a single instruction.
The most common instructions allow – a memory reference and a computation to be
performed at the same time.
Memory references – can be done two at a time in many instructions, with
each reference using a DAG.
Can issue some computations in parallel: – dual add-subtract;
– fixed-point multiply/accumulate and add, subtract
– floating-point multiply and ALU operation
ד''בס
SHARC DSP Architecture
The machine supports both memory parallelism and operation parallelism.
Reduce the number of instructions required for common operations.
For example, the basic operation in a dot product loop can be performed in one cycle that performs two fetches, a multiplication, and an addition.
DSP lab 42
ד''בס
11/15/2014 83
Example Multi-Function Instruction
In a SingleCycle the SHARC Performs:
1(2) Multiply
1 (2) Addition
1 (2) Subtraction
1 (2) Memory Read
1 (2) Memory Write
2 Address Pointer Updates
Plus the I/O Processor Performs:
Active Serial Port Channels (2 Transmit, 2 Receive)
Active Link Ports (6)
Memory DMA
2 DMA Pointer Updates
f11=f1*f7, f3=f9+f14, f9=f9-f14, dm(i2,m0)=f13,
f7=pm(i8,m8);
ד''בס
Parallelism Restrictions on the sources of the operands when
operations are combined.
The operands going to the multiplier must come from R0 through R7 (or in the case of floating-point operands, F0 to F7), with one input coming from RO-R3/FO-F3 and the other from R4-R7/f0-f7.
The ALU operands must come from R8-R15/f8-fl5, with one operand coming from R8-Rll/f8-fll and the other from R12-R15/fl2-fl5.
performs three operations: R6 = R0 * R4, R9 = R8 + R12, RI0 = R8 - R12
DSP lab 43
ד''בס
11/15/2014 85
SHARC load/store
Load/store architecture: no memory-direct operations.
Two data address generators (DAGs): data memory.
program memory;
Must set up DAG registers to control loads/stores.
Provide indexed, modulo, bit-reverse indexing. – Bit-reversal addressing can be performed only in I0 and I8, as
controlled by the BR0 and BR8 bits in the MODE1 register.
ד''בס
11/15/2014 86
BASIC addressing
Immediate value: r0 = DM(0x20000000);
Direct load: r0 = DM(_a); // Loads contents of _a
Direct store: DM(_a)= r0; // Stores R0 at _a
Base-Plus-Offset
R0 = DM(M1, I0); // Loads from location I0 + M1
//M and I are registers from the DAG register file
DSP lab 44
ד''בס
11/15/2014 87
The DSP processor ‘s architecture.
Circular buffer
ד''בס
Circular buffer
11/15/2014 88
DSP lab 45
ד''בס
Circular buffer
A circular buffer is an array
of n elements; when the n +
1th element is referenced, the
reference goes to buffer
location 0, wrapping around
from the end to the beginning
of the buffer.
L register is set with a positive,
nonzero value as the starting
point in the circular buffer,
B register of the same number
is loaded with the base address
of the circular buffer.
an
a1
a0
L1=x
an-1
I1=B1+M
B1
a2
I1
ד''בס
Circular buffer
– circular buffer with
N fixed-size slots
an
a1
L1=x
an-1
I1=B1+M
B1
a2
I1
an+1
DSP lab 46
ד''בס
11/15/2014 91
DAGs registers
I0
I1
I2
I3
I4
I5
I6
I7
M0
M1
M2
M3
M4
M5
M6
M7
L0
L1
L2
L3
L4
L5
L6
L7
B0
B1
B2
B3
B4
B5
B6
B7
ד''בס
11/15/2014 92
SHARC assembly language
I register holds start address.
M register/immediate holds modifier value. r0 = DM(I3,M3) // Load
DM(I2,1) = r1 // Store
Circular buffer: I register is buffer start index, B is buffer base address.
Allows transmission two values of data to/from memory per cycle:
f0 = DM(I0,M0), f1 = PM(I9,M8);
Compiler allows to programmer to define which memory values are stored in.
DSP lab 47
ד''בס
11/15/2014 93
SHARC assembly language
M6 = 1;
R0 = dm(I4, M6); // post-modify
// means: R0 = dm(I4), and then I4 = I4 + M6
// However:
R0 = dm(M6, I4); // offset index only
// means: R0 = dm(M6 +I4), and still keeps I4 = I4
ד''בס
11/15/2014 94
SHARC assembly language
B4 = 4000;
L4 = 0; // set to 0
I4 = 4002;
M6 = 1;
R0 = dm(M6, I4); // offset index only
R1 = dm(M6, I4); // offset index only
// means R0 = dm(4002 + 1) and R1 = dm(4002 + 1)
// with I4 = 4002 still unchanged at the end of the
code
R0 = dm(I4, M6); // post-modify
R1 = dm(I4, M6); // post-modify
// means R0 = dm(4002) and R1 = dm(4003)
// with I4 = 4004 at the end of the code
Post-incrementing and Offset
DSP lab 48
ד''בס
11/15/2014 95
SHARC assembly language
B4 = 4000;
L4 = 3;
I4 = 4002;
M6 = 1;
R0 = dm(M6, I4); // offset index only
R1 = dm(M6, I4); // offset index only
// means R0 = dm(4002 + 1) and R1 = dm(4002 + 1)
// with I4 = 4002 still
R0 = dm(I4, M6); // post-increment
R1 = dm(I4, M6); // post-increment
// means R0 = dm(4002) with I4 = 4003,
// however R1 = dm(4000) {4003 – 3} with I4 = 4001
Circular buffer implementation
ד''בס
11/15/2014 96
Example: C assignments
C: x = (a + b) - c;
Assembler: r0 = DM(_a) // Load a
r1 = DM(_b); // Load b
r3 = r0+r1;
r2 = DM(_c); // Load c
r3 = r3-r2;
DM(_x) = r3; // Store result in x
DSP lab 49
ד''בס
11/15/2014 97
Example: C assignments
C: y = a*(b+c);
Assembler: r1 = DM(_b); // Load b
r2 = DM(_c); // Load c
r2 = r1 + r2;
r0 = DM(_a); // Load a
r2 = r2*r0;
DM(_y) = r2; // Store result in y
ד''בס
11/15/2014 98
Example: C assignments
Shorter version using pointers: // Load b, c
r2 = DM(I1,M5), r1 = PM(I8,M13);
// load a in parallel with multiplication
r0 = r2+r1, r12 = DM(I0,M5);
r8 = r12*r0;
DM(I0,M5)= r8; // Store in y
DSP lab 50
ד''בס
11/15/2014 99
Example: C assignments
C: z = (a << 2) | (b & 15);
Assembler: r0 = DM(_a); // Load a
r0 = LSHIFT r0 by 2; // Left shift
r1 = DM(_b), r3 = 15;// Load immediate
r1 = r1 AND r3;
r0 = r1 OR r0;
DM(_z) = r0;
ד''בס
11/15/2014 100
SHARC jump
Unconditional flow of control change:
JUMP label;
Three addressing modes:
– Direct (specifies a 24-bit address in immediate );
– Indirect (supply by DAG2 data address generator);
– PC-relative (specifies an immediate value that is added to the
current PC).
All Instructions may be executed conditionally
– if EQ r1=pm(i15,0x11);
– if LE r0 = LSHIFT r0 by 2;
Conditions come from:
– arithmetic status (ASTAT)
– mode control 1 (MODE1)
– loop register
DSP lab 51
ד''בס
11/15/2014 101
Example: C if statement
C: if (a > b)
y = c + d;
else y = c - d;
Assembler: // if condition
r0 = DM(_a);
r1 = DM(_b);
COMP(r0,r1); // Compare
IF GT JUMP label;
// False block
r0 = DM(_c);
r1 = DM(_d);
r1 = r0 - r1;
DM(_y)= r1;
JUMP other; // Skip false block
// True block
label: r0 = DM(_c);
r1 = DM(_d);
r1 = r0 + r1;
DM(_y) = r1;
other: // Code after if
True version
EQ
LT
LE
AC
AV
TF
Description
ALU = 0
ALU<0
ALU≤0
ALU carry
ALU overflow
Bit test flag
Complement version
NE
GE
GT
NOT AC
NOT AV
NOT TF
ד''בס
11/15/2014 102
The best if implementation
C: if (a > b)
y = c + d;
else y = c - d;
Assembler: // Load values
r1 = DM(_a), r2 = PM(_b);
r3 = DM(_c), r4 = PM(_d);
// Compute both sum and difference
r12 = r3 + r4, r0 = r3 - r4;
// Choose which one to save
comp(r2,r1);
if GT r0 = r12;
dm(_y) = r0 // Write to y
DSP lab 52
ד''בס
DO UNTIL loops
DO UNTIL instruction provides efficient looping: LCNTR = 30, DO label UNTIL LCE;
r0 = DM(I0,M0), f2 = PM(I8,M8);
r1 = r0 - r15;
label: f4 = f2 + f3;
Loop length (16 bit) Last instruction in loop
Termination condition
The SHARC processor allows up to six nested loops
Another version of loop:
DO label UNTIL EQ;
R0 = R0-1;
label: comp(R0,R1);
LCE
NOT LCE
Loop counter expired
Loop counter not expired
ד''בס
11/15/2014 104
Example: FIR filter
C: for (i=0, y=0; i<N; i++)
y = y + c[i]*x[i];
C X
DSP lab 53
ד''בס
11/15/2014 105
FIR filter assembler
// setup
I0 = _c; I8 = _x;// c[0] (DAG0), x[0] (DAG1)
r12 = 0; // f = 0;
M0 = 1; M8 = 1; // Set up increments
// Loop body
LCNTR = N, DO loopend UNTIL LCE;
// Use post-increment mode
r1 = DM(I0,M0), r2 = PM(I8,M8);
r8 = r1 * r2 (uui);
loopend: r12 = r12 + r8;
ד''בס
11/15/2014 106
Example: C main + ASM function
C: int dm c[4] = {1,2,3,4};
int pm x[7] = {1,2,3,4,5,6,7};
int dm y;
extern int fir(int dm *,int pm *);
//main
void main()
{
y = fir(c,x);
}
DSP lab 54
ד''בס
11/15/2014 107
Example: C main + ASM
function Assembler: #include <asm_sprt.h>
.SEGMENT/PM seg_pmco;
.global _fir;
.extern _c, _x, _y;
_fir: entry;
// setup
I0=_c; I8=_x; // c[0](DAG0),x[0](DAG1)
// or I0 = r4, I8 = r8
r12 = 0; // f = 0;
M0=1; M8=1; // Set up increments
// Loop body
LCNTR = 4, DO loopend UNTIL LCE;
r1 = DM(I0,M0), r2 = PM(I8,M8);
r3 = r1 * r2 (ssi);
loopend: r12 = r12 + r3;
r0 = r12; // or dm(_y)=r12;
exit;
_fir.end:
.endseg;
ד''בס
11/15/2014 108
Example: Using MAC operation
Assembler: #include <asm_sprt.h>
.SEGMENT/PM seg_pmco;
.global _fir;
.extern _c, _x, _y;
_fir: entry;
//setup
I0=_c; I8=_x; // c[0](DAG0),x[0](DAG1)
//or I0 = r4, I8 = r8
r12 = 0; // f = 0;
M0=1; M8=1; // Set up increments
//Loop body
LCNTR = 4, DO loopend UNTIL LCE;
r1 = DM(I0,M0), r2 = PM(I8,M8);
loopend: MRF = MRF + r1 * r2 (ssi);
r0 = MR0F;
exit;
_fir.end:
.endseg;
DSP lab 55
ד''בס
11/15/2014 109
Example: C main + ASM function
(work with STACK)
int a,b,c,d,e,f;
extern int asm_proc( int a, int b, int c,
int d, int e );
void main()
{
a = 0xAAAAAA;
b = 0xBBBBBB;
c = 0xCCCCCC;
d = 0xDDDDDD;
e = 0xEEEEEE;
f = asm_proc(a,b,c,d,e);
}
ד''בס
11/15/2014 110
Example: C main + ASM function
(work with STACK)
int a,b,c,d,e,f;
extern int asm_proc( int a, int b, int c,
int d, int e );
void main()
{
a = 0xAAAAAA;
b = 0xBBBBBB;
c = 0xCCCCCC;
d = 0xDDDDDD;
e = 0xEEEEEE;
f = asm_proc(a,b,c,d,e);
}
DSP lab 56
ד''בס
11/15/2014 111
Example: C main + ASM function
(work with STACK) #include "asm_sprt.h"
.SEGMENT/PM seg_pmco;
.GLOBAL _asm_proc;
_asm_proc:
start:
// m7 = -1 (compiler definition)
// m6 = 1 (compiler definition)
r15 = i6;
// i6 - save C sp (stack pointer)
// i7 - asm sp (stack pointer)
i2 = r15;
modify(i2,m6);
r0 = r4;
r1 = r8;
r2 = r12;
r3 = dm(i2,m6);
// C sp + 2 (fourth argument place)
r4 = dm(i2,m6);
// C sp + 3 (fifth argument place)
r5 =0x555555;
r0 = r0 + r5;
// r0 = return()
_asm_proc.end:
.endseg;
exit;
ד''בס
11/15/2014 112
Special instructions to handle “C”
Cjump -- getting to “C” compatible subroutine
– Processor architecture customized for C
– Replaces 3 instructions for faster operations
– Difficult to use in ENCM515
• Will not be having assembly code calling other subroutines (95%) -- Why
bother since slow!
RFRAME -- returning to “C” environment
– Processor architecture customized for C
– Part of MAGIC lines of code
– See reference card
DSP lab 57
ד''בס
11/15/2014 113
“C” interface to assembly code
C/ASSEMBLY LANGUAGE INTERFACE
Special Purpose Registers – usage predetermined by compiler
I7 – C runtime stack pointer – next empty place – NOT last used
I6 – C runtime frame pointer – start of frame of current function
(cdefines.i -- I7 = CTOPstack, I6 = FP)
L6/L7 – must remain as zero – controls stack memory
characteristics
DAG1 registers – M5 (0), M6 (1), M7 (-1) – in “C” runtime header
(cdefines.i -- zeroDM, plus1DM, minus1DM)
DAG2 registers – M13 (0), M14 (1), M15 (-1) – in “C” header
(cdefines.i – zeroPM, plus1PM, minus1PM)
LENGTH registers MUST RETURN to 0 – don’t touch L6/L7
ד''בס
11/15/2014 114
Volatile registers when using “C”
R4 (INPAR1), R8 (INPAR2), R12 (INPAR3), R0 (retvalue)
Scratch or Volatile registers (cdefines.i definitions)
Don’t keep useful values in them across subroutine calls
R0, R1, R2 (cdefines.i -- retvalue, scratchR1, scratchR2)
R4, I4, M4 (cdefines.i -- INPAR1, scratchDMpt, scratchMDM)
R8 (cdefines.i -- INPAR2)
R12, I12, M12 (cdefines.i -- INPAR3, scratchPMpt, scratchMPM)
DSP lab 58
ד''בס
Important programming reminders
Registers for parameters transfer: r4,r8,r12,r0;
Interrupt does not occur until 2 instructions after delayed branch (needs 2 NOPs);
Some DAG register transfers are disallowed in assembler routine;
It is preferable not use the following couples in all combinations: (M7,I6), (M14,I12), (M6,I5), (M5,I6).
ד''בס
SOS filters creation
% The cut off frequency normalization with fs/2. Wn=[200 1000]/(fs/2); % calculate coeffitions for Bandpass filter N = 5 [b, a] = butter(N, Wn); Convertion zero-pole-gain filter parameters to second-
order sections form [z,p,k]=butter(N, Wn); [sos,g] = zp2sos(z,p,k);
11/15/2014 116
H(z) H1(z) H2(z) H3(z) x(t) x(t) y(t) x(t) z1(t) z2(t)
DSP lab 59
ד''בס
11/15/2014 117
Getting started
ד''בס
11/15/2014 118
DSP lab 60
ד''בס
11/15/2014 119
ד''בס
11/15/2014 120
DSP lab 61
ד''בס
11/15/2014 121
ד''בס
11/15/2014 122
DSP lab 62
ד''בס
11/15/2014 123
ד''בס
11/15/2014 124
DSP lab 63
ד''בס
11/15/2014 125
ד''בס
11/15/2014 126
DSP lab 64
ד''בס
11/15/2014 127
ד''בס
11/15/2014 128
DSP lab 65
ד''בס
11/15/2014 129
ד''בס
11/15/2014 130
DSP lab 66
ד''בס
11/15/2014 131
ד''בס
11/15/2014 132
DSP lab 67
ד''בס
11/15/2014 133
ד''בס
11/15/2014 134
DSP lab 68
ד''בס
11/15/2014 135
ד''בס
11/15/2014 136
DSP lab 69
ד''בס
11/15/2014 137
ד''בס
11/15/2014 138
DSP lab 70
ד''בס
11/15/2014 139
ד''בס
11/15/2014 140
DSP lab 71
ד''בס
11/15/2014 141
ד''בס
11/15/2014 142
DSP lab 72
ד''בס
11/15/2014 143
ד''בס
11/15/2014 144
DSP lab 73
ד''בס
11/15/2014 145
ד''בס
11/15/2014 146
DSP lab 74
ד''בס
11/15/2014 147
ד''בס
11/15/2014 148
DSP lab 75
ד''בס
11/15/2014 149
ד''בס
11/15/2014 150
DSP lab 76
ד''בס
11/15/2014 151
ד''בס
11/15/2014 152
DSP lab 77
ד''בס
11/15/2014 153
ד''בס
11/15/2014 154
DSP lab 78
ד''בס
11/15/2014 155
ד''בס
11/15/2014 156
DSP lab 79
ד''בס
11/15/2014 157
ד''בס
11/15/2014 158
DSP lab 80
ד''בס
11/15/2014 159
Paths to examples
C:\Program Files\Analog Devices\VisualDSP\211xx\examples
C:\Program Files\Analog Devices\VisualDSP\21k\examples
or
D:\Program Files\Analog Devices\VisualDSP\211xx\examples
D:\Program Files\Analog Devices\VisualDSP\21k\examples
ד''בס
11/15/2014 160
Introduction to DSP processors
The END
DSP lab 81
ד''בס
References
1. Marc Moonen, Lecture 4 : Filter
implementation, lecture slides.
2. Kyungtae Han, ``Fixed-Point Wordlength
Optimization and Its Applications to
Broadband Wireless Demodulator
Design,'' Samsung Advanced Institute of
Technology, Korea, Jun 24, 2004