Introduction to DSP - Unict to DSP Maurizio Palesi. Maurizio Palesi 2 What is a DSP? Digital ......
Transcript of Introduction to DSP - Unict to DSP Maurizio Palesi. Maurizio Palesi 2 What is a DSP? Digital ......
Maurizio Palesi 1
Introduction to DSPIntroduction to DSP
Maurizio Palesi
Maurizio Palesi 2
What is a DSP?What is a DSP? Digital
Operating by the use of discrete signals to represent data in the form of numbers
Signal A variable parameter by which information is conveyed through an
electronic circuit
Processing To perform operations on data according to programmed
instructions
Digital Signal Processing Changing or analysing information which is measured as discrete
sequences of numbers
Maurizio Palesi 3
Main CharacteristicsMain Characteristics Compared to other embedded computing applications,
DSP applications are differentiated by the followingComputationally demanding
Iterative numeric algorithms
Sensitivity to small numeric errors (audible noise)Stringent real-time requirementsStreaming dataHigh data bandwidthPredictable (though often eccentric) memory access patternPredictable program flow (nested loops)
Maurizio Palesi 4
DSP ProcessorsDSP Processors
1970DSP techniques in telecommunication
equipment
MicroprocessorCustom
fixed function hardware
DSP processors
Not adequate performance
Not adequate flexibility and reusability
Maurizio Palesi 5
DSP vs. General PurposeDSP vs. General Purpose DSPs adpot a range of specialized features
Single-cycle multiplier Multiply-accumulate operations Saturation arithmetic Separate program and data memories Dedicated, specilized addressing hw Complex, specialized instruction sets
Today, virtually very commercial 32-bit microprocessor architecture (from ARM to 80x86) has been subject to some kind of DSP-oriented enhancement
GP DSP
VLIW, Superscalar, SIMD,multiprocessing, ...
Maurizio Palesi 6
GP Microprocessor GP Microprocessor (circa 1984: Intel (circa 1984: Intel 8088)8088) ~100,000 transistors Clock speed: ~ 5 MHz Address space: 20 bits Bus width: 8 bits 100+ instructions 2-35 cycles per instruction Microcoded architecture Many addressing modes Relatively inexpensive
Apparent trendsLarger address spaceHigher clock speedMore transistorsMore instructionsMore arithmetic
capabilityMore memory
management support
Maurizio Palesi 7
DSP Microprocessors DSP Microprocessors (circa 1984: (circa 1984: TMS32010)TMS32010)
Clock speed: 20 MHz Word/bus width: 16 bits Address space: 8, 12 bits ~50,000 transistors ~ 35 instructions 4-cycle execute of most
instructions Harvard architecture: separate
program and data memory, buses
16x16 hardware multiplier Double-length accumulator with
saturation A few special DSP instructions Relatively expensive
Apparent trendsHigher clock ratesFewer
cycles/instructionSomewhat expanded
address spacesMore specialized DSP
instructionsLower cost
Maurizio Palesi 8
RISC Processors circa 1984RISC Processors circa 1984 Academic research topic 12-16 instructions Single-cycle execute No microcode! Small, heavily optimized instruction set executable in single
short cycle All instructions same size No microcode = faster execution Extra speed more than offsets increased code size,
reduced functionality Better compiler target
Maurizio Palesi 9
Arguments Advanced for CISCArguments Advanced for CISC
Fewer instructions per taskShorter programsHardware implementation of complex
instructions faster than softwareExtra addressing modes help compiler
Maurizio Palesi 10
The RISC vs CISC ControversyThe RISC vs CISC Controversy
Lots of argumentHundreds of papersHottest topic in computer architectureIn mid to late ‘80s, many RISC uPs
introduced: MIPS, SPARC (Sun), MC88000, PowerPC, I960 (Intel), PA-RISC
For a time, RISC looked tough to beat ...
Maurizio Palesi 11
CISC Processors circa 1999CISC Processors circa 1999 Clock Speed: ~400 MHz Several million transistors 32-bit address space or more 32-bit external buses, 128-bit internally ~ 100 instructions Superscalar CPU Judiciously microcoded On-chip cache Very complex memory hierarchy Single-cycle execute of most instructions! 32-bit floating-point ALU on board! Multimedia extensions Harvard architecture (internally)! Very expensive (100s of dollars) 10s of Watts power consumption
Maurizio Palesi 12
RISC Architectures circa 1999RISC Architectures circa 1999
The same!
Maurizio Palesi 13
DSP Microprocessors circa 1999DSP Microprocessors circa 1999
Clock speed: 100-200 MHz 16-bit (fixed point) or 32-bit (floating point) buses
and word sizes 16-24 bit address space Some on-chip memory Single-cycle execution of most instructions Harvard architecture Lots of special DSP instructions 50mW to 2 W power consumption Cheap!
Maurizio Palesi 14
QuestionQuestion If CISC and RISC have adopted all the distinguishing
features of early DSP microprocessors and more, why didn’t they take over the DSP embedded market, too?Current high-volume DSP applications (e.g., hard disk drive
controllers, cell phones) require low cost, low powerDSP uPs stripped of all but the most essential features for DSP
applicationsMost quoted numbers for DSP uPs not MIPS, but MIPS/$$, MIPS/
mWMarket needs in DSP embedded systems are sufficiently different
that no single architectural family can compete in both DSP and general-purpose uP market
Maurizio Palesi 15
Time Domain ProcessingTime Domain ProcessingCorrelation Autocorrelation to extract a signal from
noise Cross correlation to locate a know signal Cross correlation to identify a signal Convolution
Maurizio Palesi 16
CorrelationCorrelation Correlation is a weighted moving average
Requires a lot of calculation If one signal is of length M and the other is of length N, then we
need (N * M) multiplications, to calculate the whole correlation function
Note that really, we want to multiply and then accumulate the result - this is typical of DSP operations and is called a multiply & accumulate operation
r n =∑k
x k × y kn
x
y
Shift y by n
Multiply the two together
Integrate
Maurizio Palesi 17
CorrelationCorrelation Correlation is a maximum when two signals are similar in shape
Correlation is a measure of the similarity between two signals as a function of time shift between them
If two signals are similar and unshifted...
their product is all positive
But as the shift increase...
parts of it become negative...
and the correlation function shows where the signals are similar and unshifted
Maurizio Palesi 18
EEG signal
EEG autocorrelation
Detecting PeriodicityDetecting Periodicity
Autocorrelation as a way to detect periodicity in signals
Maurizio Palesi 19
EEG signalwith noise
EEG with noise autocorrelation
Detecting PeriodicityDetecting Periodicity
Although a rhythm is not even visible (upper trace) it is detected by autocorrelation (lower trace)
Maurizio Palesi 20
Align SignalsAlign Signals
6
6
Signal x
Signal y
corr(x,y)
Maurizio Palesi 21
Align SignalsAlign Signals
x
y
Maurizio Palesi 22
Cross correlationCross correlation Cross correlation (correlating a signal with another) can be
used to detect and locate known reference signal in noise
A radar or sonar ‘chirp’ signal
bounced off a target may be buried in noise...
bounced but correlating with the ‘chirp’ reference
crearly reveals when the echo comes
Maurizio Palesi 23
Cross Corelation to Identify a SignalCross Corelation to Identify a Signal
Cross correlation (correlating a signal with another) can be used to identify a signal by comparison with a library of known reference signals
The chirp of a nightingale...
correlates strongly with another nightgale...
but weakly with a dove...
or a heron...
Maurizio Palesi 24
Cross Corelation to Identify a SignalCross Corelation to Identify a Signal
Cross correlation is one way in which sonar can identify different types of vesselEach vessel has a unique sonar signature
The sonar system has a library of pre-recorded echoes from different vessels
An unknown sonar echo is correlated with a library of reference echoes
The largest correlation is the most likely match
Maurizio Palesi 25
ConvolutionConvolution Correlation is a weighted moving average with one signal
flipped back to front
Requires a lot of calculation If one signal is of length M and the other is of length N, then we
need (N * M) multiplications, to calculate the whole convolution function
We need to multiply and then accumulate the result - this is typical of DSP operations and is called a multiply & accumulate operation
r n =∑k
x k × y k−n
To convolve one signal
with another signal
first flip the second signal
Then shift it
Then multiply the two together
And integrate under the curve
Maurizio Palesi 26
Convolution vs. CorrelationConvolution vs. Correlation Convolution is used for digital filtering
Convolving two signals is equivalent to multiplying the frequency spectra of the two signals togetherIt is easily understood, and is what we mean by filtering
Correlation is equivalent to multiplying the complex conjugate of the frequency spectrum of one signal by the frequency spectrum of the otherIt is not so easily understood and so convolution is used for
digital filtering
Convolving by multiplying frequency spectra is called fast convolution
Maurizio Palesi 27
Fourier TransformFourier Transform The Fourier transform is an equation to calculate the frequency,
amplitude and phase of each sine needed to make up any given signal
The Fourier Transform (FT) is a mathematical formula using integrals
The Discrete Fourier Transform (DFT) is a discrete numerical equivalent using sums instead of integrals
The Fast Fourier Transform (FFT) is just a computationally fast way to calculate the DFT
The Discrete Fourier Transform involves a summation
DFT and the FFT involve a lot of multiply and accumulate the result
This is typical of DSP operations and is called a multiply & accumulate operation
H f =∑k
c [k ]×e−2π jk fΔ
Maurizio Palesi 28
FilteringFiltering
The function of a filter is to remove unwanted parts of the signalRandom noise
Extract useful parts of the signal
Components lying within a certain frequency range
FiltersAnalog
Digital
FilterRaw
signalFilteredsignal
Maurizio Palesi 29
Analog FiltersAnalog FiltersAn analog filter uses analog electronic
circuits Use components such as resistors, capacitors
and op amps
Widely used in such applicationsNoise reduction
Video signal enhancement
Graphic equalisers in hi-fi systems
..., and many other areas
Maurizio Palesi 30
Digital FiltersDigital FiltersA digital filter uses a digital processor to
perform numerical calculations on sampled values of the signalSpecialised DSP chip
DSP
Unfilteredanalogsignal
Filteredanalogsignal
A/D D/A
Sampleddigitisedsignal
Digitallyfilteredsignal
Maurizio Palesi 31
Advantage of Digital FiltersAdvantage of Digital Filters Programmability
The digital filter can easily be changed without affecting the circuitry
Analog filter circuits are subject to drift and are dependent on temperature
Digital filters can handle low frequency signals accurately As the speed of DSP technology continues to increase,
digital filters are being applied to high frequency signals in the RF domain
Versatility Adapt to changes in the characteristics of the signal
Maurizio Palesi 32
DSP ProcessorsDSP ProcessorsCharacteristic features of DSP processors Addressing modesMemory architecturesBrief overview of the TMS320C3x family
General featuresAddressing modesISA overviewAssembly programming
FIR filtesMatrix-Vector multiplication
Maurizio Palesi 33
Characteristics of DSP ProcessorsCharacteristics of DSP Processors
DSP processors are mostly designed with the same few basic operations in mind
They share the same set of basic characteristicsSpecialised high speed arithmetic Data transfer to and from the real world Multiple access memory architectures
Maurizio Palesi 34
Characteristics of DSP ProcessorsCharacteristics of DSP Processors
The basic DSP operationsAdditions and multiplications
Fetch two operands Perform the addition or
multiplication (usually both) Store the result or hold it for a
repetition
Delays Hold a value for later use
Array handling Fetch values from consecutive
memory locationsCopy data from memory to
memory
Z-1
Z-2
c[0]
c[1]
c[2]
x y
Maurizio Palesi 35
Characteristics of DSP ProcessorsCharacteristics of DSP Processors
To suit these fundamental operations DSP processors often haveParallel multiply and add Multiple memory accesses (to fetch two operands and store the
result) Lots of registers to hold data temporarily Efficient address generation for array handling
Special features such as delays or circular addressing
Maurizio Palesi 36
Address GenerationAddress Generation The ability to generate new addresses efficiently is a characteristic
feature of DSP processors
Usually, the next needed address can be generated during the data fetch or store operation, and with no overhead
DSP processors have rich sets of address generation operations
*rP register indirect read the data pointed to by the address in register rP
*rP++ postincrement
*rP-- postdecrement
*rP++rI
*rP++rIr bit reversed
having read the data, postincrement the address pointer to point to the next value in the arrayhaving read the data, postdecrement the address pointer to point to the previous value in the array
register postincrement
having read the data, postincrement the address pointer by the amount held in register rI to point to rI values further down the arrayhaving read the data, postincrement the address pointer to point to the next value in the array, as if the address bits were in bit reversed order
Maurizio Palesi 37
Bit Reversed AddressingBit Reversed Addressing DSPs are tightly targeted to a small number of algorithms
It is surprising that an addressing mode hase been specifically defined for just one application (the FFT)
0 (0002) 0 (0002)1 (0012) 4 (1002)2 (0102) 2 (0102)3 (0112) 6 (1102)4 (1002) 1 (0012)5 (1012) 5 (1012)6 (1102) 3 (0112)7 (1112) 7 (1112)
Addresses generated by a radix-2 FFT Whithout special support such address transformations would Take an extra memory access to
get the new address Involve a fair amount of logical
instructions
Maurizio Palesi 38
Memory AddressingMemory Addressing As DSP programmers migrate toward larger programs, they are more
attracted to compilers Such compilers are not able to fully exploit such specific addressing modes DSP community routinely uses library routines
Programmers may benefit even if they write at a high level
Addressing mode Percent
Immediate 30.02%
Displacement 10.82%
Register indirect 17.42%
Direct 11.99%
Autoincrement, postincrement 18.84%
Autoincrement, preincrement with 16 bit immediate 0.77%
Autoincrement, preincrement with circular addresing 0.08%
Autoincrement, postincrement by contents of AR0 1.54%
Autoincrement, postincrement by contents of AR0, with circular addressing 2.15%
Autodecrement, postdecrement 6.08%
~90%
Maurizio Palesi 39
DSP Processors: Input/OutputDSP Processors: Input/Output
DSP
DSP DSP
System controller
Other DSP
Signal In Signal Out
DSP is mostly dealing with the real world• Communication with an overall system controller• Signals coming in and going out• Communication with other DSP processors
Maurizio Palesi 40
DSP EvolutionDSP Evolution When DSP processors first came out, they were rather fast processors
The first floating point DSP, the AT&T DSP32, ran at 16 MHz at a time when PC computer clocks were 5 MHz
A fashionable demonstration at the time was to plug a DSP board into a PC and run a fractal (Mandelbrot) calculation on the DSP and on a PC side by side
The DSP fractal was of course faster
Today…
The fastest DSP processor is the Texas TMS320C6201 which runs at 200 MHz
This is no longer very fast compared with an entry level PC
– ...And the same fractal today will actually run faster on the PC than on the DSP!
But…
Try feeding eight channels of high quality audio data in and out of a Pentium simultaneously in real time, without impacting on the processor performance
Maurizio Palesi 41
SignalsSignalsThey are usually handled by high speed
synchronous serial portsSerial ports are inexpensive
Having only two or three wires
Well suited to audio or telecommunications data rates up to 10 Mbit/s
Usually operate under DMAData presented at the port is automatically written
into DSP memory without stopping the DSP
Maurizio Palesi 42
Host CommunicationsHost Communications Many systems will have another, general purpose,
processor to supervise the DSPFor example, the DSP might be on a PC plug-in card
Whereas signals tend to be continuous, host communication tends to require data transfer in batches for instance to download a new program or to update filter
coefficients
Some DSP processors have dedicated host portsLucent DSP32C has a host port which is effectively an 8 bit or 16
bit ISA bus
the Motorola DSP56301 and the Analog Devices ADSP21060 have host ports which implement the PCI bus
Maurizio Palesi 43
Interprocessor CommunicationsInterprocessor Communications
Interprocessor communications is needed when a DSP application is too much for a single processor
The Texas TMS320C40 and the Analog Devices ADSP21060 both have six link ports Would ideally be parallel ports at the word length of the
processor, but this would use up too many pins
A hybrid called serial/parallel is used 'C40, comm ports are 8 bits wide and it takes four transfers to
move one 32 bit word
21060, link ports are 4 bits wide and it takes 8 transfers to move one 32 bit word
Maurizio Palesi 44
Memory ArchitecturesMemory Architectures Additions and multiplications require us to
Fetch two operands
Perform the addition or multiplication (usually both)
Store the result or hold it for a repetition
To fetch the two operands in a single instruction cycle, we need to be able to make two memory accesses simultaneouslyPlus one access to write back the result
Plus one access to fetch the instruction itself
Maurizio Palesi 45
Memory ArchitecturesMemory Architectures There are two common methods to achieve
multiple memory accesses per instruction cycleHarvard architecture
Modified von Neuman architecture
Maurizio Palesi 46
Harvard ArchitectureHarvard Architecture
DSP operations usually involve at least two operandsDSP Harvard architectures usually permit the program bus to be
used also for access of operands
It is often necessary to fetch the instruction too
The Harvard architecture is inadequate to support this
Super Harvard architecture (SHARC)
– DSP Harvard architectures often also include a cache memory, leaving both Harvard buses free for fetching operands
DSPProgram Data
Maurizio Palesi 47
Modified von Neuman ArchitecturesModified von Neuman Architectures
The Harvard architecture requires two memory busesThis makes it expensive to bring off the chip
Even the simplest DSP operation requires four memory accesses (three to fetch the two operands and the instruction, plus a fourth to write the result)This exceeds the capabilities of a Harvard architecture
Some processors get around this by using a modified von Neuman architecture
Maurizio Palesi 48
Modified von Neuman ArchitecturesModified von Neuman Architectures
The modified von Neuman architecture allows multiple memory accesses per instructionRun the memory clock faster than the instruction cycle
Lucent DSP32C runs with an 80 MHz clockThis is divided by four to give 20 MIPS
The memory clock runs at the full 80 MHz
Each instruction cycle is divided into four 'machine states' and a memory access can be made in each machine state
DSPProgram
&Data
Maurizio Palesi 49
Example ProcessorExample ProcessorAddress generation Lots of registers
Efficient I/OParallel multiply/add
Multiple memories
Other DSP
Systemcontroller
Signal out
Signal in
Maurizio Palesi 50
Example Processor: Lucent DSP32CExample Processor: Lucent DSP32C
Address 24
Data 32
40
22x24 bit registersAlso serve for integer arithmetic
Modified von Neuman architecture
Maurizio Palesi 51
Example Processor: ASP21060Example Processor: ASP21060
Data address 32Data data 40
Prog. address 24
Prog data 48
Two serial ports
Super Harvard architecture
Six link ports
Maurizio Palesi 52
Data FormatsData Formats DSP processors store data in fixed or floating point formats
The programmer has to make some decisions If a fixed point number becomes too large for the available word
length, he has to scale the number down, by shifting it to the right
If a fixed point number is small, he has to scale the number up, in order to use more of the available word length
0 1 0 1 0 0 1 1
-27 26 25 24 23 22 21 20 = 26 + 24 + 21 + 20= 83
Integer
0 1 0 1 0 0 0 0
-20 2-1 2-2 2-3 2-4 2-5 2-6 2-7 = 2-1 + 2-3 = 0.5 + 0.125 = 0.625
Fixed point
Maurizio Palesi 53
Fixed PointFixed PointFixed point can be thought of as just low-
cost floating pointIt does not include an exponent in every wordNo hw that automatically aligns and normalizes
operandsDSP programmer take cares to keep the exponent in
a separate variableOften this variable is shared by a set of fixed-point
variables– Blocked floating point
Maurizio Palesi 54
Floating PointFloating Point Floating point format has the remarkable property of
automatically scaling all numbers by moving, and keeping track of, the binary point so that all numbers use the full word length available but never overflow
-2-1 20 2-1 2-2 2-3 2-4 2-5 2-6 2-7
Mantissa = 20 + 2-1 + 2-3= 1 + 0.5 + 0.125 = 1.625
Exponent = 22 + 21 = 6
Decimal value = 1.625 × 26
0 1 1 0
Mantissa Exponent0 1 1 0 1 0 0 0 0
-23 22 21 20
Maurizio Palesi 55
Data FormatsData Formats In Floating Point the HW automatically scales and
normalises every number Errors due to truncation and rounding depend on the size
of the number These errors can be seen as a source of quantisation
noiseThen the noise is modulated by the size of the signalThe signal dependend modulation of the noise is undesiderable
because is audibleThe audio industry prefers to use fixed point DSP processors over
floating point
Maurizio Palesi 56
Saturating ArithmeticsSaturating Arithmetics DSPs are often used in real-time applications
No exception on arithmentic overflow It could miss an event
To support such an environment, DSP architectures use saturating arithmetic If the result is too large to be represented, it is set to the largest representable
number
Saturating arithmeticNormal two’s complement arithmetic
Maurizio Palesi 57
Programming a DSP ProcessorProgramming a DSP Processor
A simple FIR filter programUsing pointersAvoiding memory bottlenecksAssembler programming
Maurizio Palesi 58
A Simple FIR FilterA Simple FIR Filter The simple FIR filter equation is
Which can be implemented quite directly in C language
y [n ]=∑k
c [k ]×x [n−k ]
y[n] = 0.0;
for (k=0; k<N; k++)
y[n] = y[n] + c[k] * x[n-k];
Accessedrepeatedly
Accessing by array index is
inefficient
Arithmetic is needed to
calculate this array index
Maurizio Palesi 59
Problem in AddressingProblem in AddressingFive operation to calculate the address of
the element x[n-k]Load the start address of the table in memory Load the value of the index n Load the value of the index k Calculate the offset [n - k] Add the offset to the start address of the array
Only after all five operations can the compiler actually read the array element
Maurizio Palesi 60
Using PointersUsing Pointersy[n] = 0.0;
for (k=0; k<N; k++)
y[n] = y[n] + c[k] * x[n-k];
float *y_ptr, *c_ptr, *x_ptr;
y_ptr = &y[n];
for (k=0; k<N; k++)
*y_ptr = *y_ptr + *c_ptr++ * *x_ptr--;
c x y
c_ptr x_ptr y_ptr
Maurizio Palesi 61
float *y_ptr, *c_ptr, *x_ptr;
y_ptr = &y[n];
for (k=0; k<N; k++)
*y_ptr = *y_ptr + *c_ptr++ * *x_ptr--;
Using PointersUsing Pointers
Each pointer still has to be initialisedBut only once, before the loop
Not requiring any arithmetic to calculate offsets
Using pointers is more efficient than array indices on any processor It is especially efficient for DSP processors
Address increments often come for free
Maurizio Palesi 62
*rP register indirect read the data pointed to by the address in register rP
*rP++ postincrement
*rP-- postdecrement
*rP++rI
having read the data, postincrement the address pointer to point to the next value in the arrayhaving read the data, postdecrement the address pointer to point to the previous value in the array
register postincrement
having read the data, postincrement the address pointer by the amount held in register rI to point to rI values further down the array
Using PointersUsing Pointers
The address increments are performed in the same instruction as the data access to which they referThey incur no overhead at all
Most DSP processors can perform two or three address increments for free in each instruction
So the use of pointers is crucially important for DSP processors
Maurizio Palesi 63
Limiting Memory AccessesLimiting Memory Accesses
Four memory accessesEven without counting the need to load the instruction,
this exceeds the capacity of a DSP processor
Fortunately, DSP processors have lots of registers
float *y_ptr, *c_ptr, *x_ptr;
y_ptr = &y[n];
for (k=0; k<N; k++)
*y_ptr = *y_ptr + *c_ptr++ * *x_ptr--;
Store Load Load Load
Maurizio Palesi 64
Limiting Memory AccessesLimiting Memory Accesses
register float temp;
temp = 0.0;
for (k=0; k<N; k++)
temp = temp + *c_ptr++ * *x_ptr--;
This initialization is wasted!
register float temp;
temp = *c_ptr++ * *x_ptr--;
for (k=1; k<N; k++)
temp = temp + *c_ptr++ * *x_ptr--;
Maurizio Palesi 65
Compiler for DSPsCompiler for DSPs Despite the well documented advantages in programmer productivity
and software maintenance...
Convolution 11.8 16.5 Convolution encoder 44.0 0.5FIR 11.5 8.7 Fixed-point complex FFT 13.5 1.0Matrix 1x3 7.7 8.1 Viterbi GSM decoder 13.0 0.7FIR2dim 5.3 6.5 Fixed-point bit allocation 7.0 1.4Dot product 5.2 14.1 Autocorrelation 1.8 0.7LMS 5.1 0.7N real update 4.7 14.1IIR n biquad 2.4 8.6N complex update 2.4 9.8Matrix 1.2 5.1Complex update 1.2 8.7IIR one biquad 1.0 6.4Real update 0.8 15.6C54 geometric mean 3.2 7.8 C62 geometric mean 10.0 0.8
TMS320C54 D (C54) for DSPstone kernels
Ratio to assembly in execution time (>1
means slower)
Ratio to assembly in code space (>1 means bigger)
TMS320C6203 (C62) for EEMBC Telecom kernels
Ratio to assembly in execution time (>1
means slower)
Ratio to assembly in code space (>1 means bigger)
Maurizio Palesi 66
Introduction Introduction TMS320C3xTMS320C3xThe TMS320C3x generation of DSPs are
high performance 32-bit floating-point devices in the TMS320 family
Extensive internal busingPowerful DSP instruction set 60 MFLOPSHigh degree of on-chip parallelism
Up to 11 operations in a single instruction
Maurizio Palesi 67
General FeaturesGeneral FeaturesGeneral-purpose register fileProgram cacheDedicated auxiliary register arithmetic units
(ARAU)Internal dual-access memoriesDirect memory access (DMA) Short machine-cycle time
Maurizio Palesi 68
Block DiagramBlock Diagram
Maurizio Palesi 69
C3x FamilyC3x Family
TMS320C30• 4K ROM• 2K RAM• Second serial port• Second External bus
TMS320C31• Low cost version• Boot loader program• 2K RAM• Single serial port• Single External bus
TMS320C32• Enhanced version of 'C3x family• Variable-width memory interface• Two channel DMA coprocessor with configurable priorities• Relocatable interrupt vector table
TMS320C33• Low power• Boot loader program• 34K RAM• Single serial port• Single External bus
Maurizio Palesi 70
Typical Applications (1/2)Typical Applications (1/2)
Maurizio Palesi 71
Typical Applications (2/2)Typical Applications (2/2)
Maurizio Palesi 72
Central Processing Unit (CPU)Central Processing Unit (CPU)
The ’C3x devices have a register-based CPU architecture
The CPU consists of the following componentsFloating-point/integer multiplierArithmetic logic unit (ALU)32-bit barrel shifterInternal buses (CPU1/CPU2 and REG1/REG2)Auxiliary register arithmetic units (ARAUs)CPU register file
Maurizio Palesi 73
Block diagram of the CPU
Maurizio Palesi 74
Single-cycle multiplications (33 ns, 30 MHz) 24-bit integer
Result 32-bit
32-bit floating-point Result 40-bit
Maurizio Palesi 75
The ALU performs single-cycle operations on 32-bit integer 32-bit logical 40-bit floating-point data
Single-cycle integer and floating point conversions
The barrel shifter is used to shift up to 32 bits left or right in a single cycle
Maurizio Palesi 76
Four internal busesCPU1, CPU2, REG1, and
REG2 carryTwo operands from
memory Two operands from the
register file
Allowing parallel multiplies and adds/subtracts on four integer or floating-point operands in a single cycle
Maurizio Palesi 77
Two auxiliary register arithmetic units (ARAU0 and ARAU1) can generate two addresses in a single cycle
The ARAUs operate in parallel with the multiplier and ALU
They support addressing with Displacements Index registers (IR0 and IR1) Circular Bit-reversed addressing
Maurizio Palesi 78
28 registers in a multiport register file
All of the primary registers can be Operated upon by the multiplier and
ALU Used as general-purpose registers
The registers also have some special functions The eight extended-precision
registers are especially suited for maintaining extended-precision floating-point results
The eight auxiliary registers support a variety of indirect addressing modes and can be used as general-purpose 32-bit integer and logical registers
The remaining registers provide such system functions as addressing, stack management, processor status, interrupts, and block repeat
Maurizio Palesi 79
RAM, ROM, and CacheRAM, ROM, and Cache Each RAM and ROM block
is capable of supporting two CPU accesses in a single cycle
E.g. In a single cycle the CPU can Access two data values in
one RAM block
Perform an external program fetch
Perform a DMA transfer loading the other RAM block
Maurizio Palesi 80
PeripheralsPeripherals Timers
The two timer modules are general-purpose 32-bit timer/event counters
Serial portsThe bidirectional serial
ports are totally independent
Each serial port can be configured to transfer 8, 16, 24, or 32 bits of data per word
Maurizio Palesi 81
Direct Memory Access (DMA)Direct Memory Access (DMA) The DMA controller can read/write any
location in the memory map without interfering with the CPU operation
Dedicated DMA address and data buses minimize conflicts between the CPU and the DMA controller
When the CPU and DMA access the same resources priorities must be established CPU DMA Rotating
Maurizio Palesi 82
Extended Precision Registers (R7-R0)Extended Precision Registers (R7-R0)
Can store and support operations on 32-bit integer and 40-bit floating-point numbers
Maurizio Palesi 83
Auxiliary Registers (AR7-AR0)Auxiliary Registers (AR7-AR0)
The CPU can access the Eight 32-bit auxiliary registers (AR7−AR0)Two auxiliary register arithmetic units (ARAUs)
The primary function of the auxiliary registers is the generation of 24-bit addressesThey can also operate as loop counters in indirect addressing 32-bit general purpose registers that can be modified by the
multiplier and ALU
Maurizio Palesi 84
Other RegistersOther Registers Data-Page pointer (DP)
The eight LSBs of the data-page pointer are used by the direct addressing mode as a pointer to the page of data being addressed
Data pages are 64K-words long, with a total of 256 pages
Index Registers (IR1, IR0)Used by the ARAU for indexing the address
Block size register (BK)Used by the ARAU in circular addressing to specify the data block
size
System-stack Pointer (SP)Contains the address of the top of the system stackSP always points to the last element pushed onto the stackSP is manipulated by interrupts, traps, calls, returns, and the
PUSH, PUSHF, POP, and POPF instructions
Maurizio Palesi 85
Status Register (ST)Status Register (ST) Contains global information about the state of the CPU
Operations usually set the condition flags of the status register according to whether the result is 0, negative, etc.
Global interrupt enable
Clear cache
Cache enable
Cache freeze
Repeat mode
Overflow mode
Latched floating point
overflow
Latched overflow
floating point
underflow
Negative
Zero
Overflow
Carry
Maurizio Palesi 86
Repeat Counter (RC) and Block Repeat Repeat Counter (RC) and Block Repeat (RS,RE)(RS,RE)
RC is a 32-bit register that specifies the number of times a block of code is to be repeated when a block repeat is performedIf RC=n, the loop is executed n+1 times
RS register contains the starting address of the program-memory block to be repeated when the CPU is operating in the repeat mode
RE register contains the ending address of the program-memory block to be repeated when the CPU is operating in the repeat mode
Maurizio Palesi 87
Instruction Cache (1/2)Instruction Cache (1/2) 64×32-bit instruction cache
2-way set associativeLRU replacement policy
It allows the use of slow, external memories while still achieving single-cycle access performances
The cache also frees external buses from program fetches so that they can be used by the DMA or other system elements
Maurizio Palesi 88
Instruction Cache (2/2)Instruction Cache (2/2)
Maurizio Palesi 89
Addressing ModesAddressing ModesFive types of addressing
Register addressingDirect addressingIndirect addressingImmediate addressingPC-relative addressing
Plus two specialized addressing modesCircular addressingBit-reverse addressing
Maurizio Palesi 90
Register AddressingRegister AddressingA CPU register contains the operand
Every CPU’s registers can be used (R0-R7, AR0-AR7, DP, IR0, IR1, …)
ABSF R1 ; R1 = |R1|
Maurizio Palesi 91
Direct AddressingDirect Addressing The data address is formed by the concatenation of the
eight LSBs of the data-page pointer (DP) with the 16 LSBs of the instruction word (expr)This results in 256 pages (64K words per page)
Maurizio Palesi 92
Direct AddressingDirect Addressing
ADDI @0BCDEh, R7
00 0000 0000
8A
1234 5678
R7
DP
8ABCDEh
Data memory
Before Instruction
00 1234 5678
8A
1234 5678
R7
DP
8ABCDEh
Data memory
After Instruction
Maurizio Palesi 93
Indirect AddressingIndirect Addressing Specifies the address of an operand in memory through
the contents of Auxiliary registerOptional displacements Index registers
The auxiliary register arithmetic units (ARAUs) perform the unsigned arithmetic
Maurizio Palesi 94
Indirect AddressingIndirect AddressingIndirect addressing with displacement
Maurizio Palesi 95
Indirect AddressingIndirect AddressingIndirect addressing with index register IR0
Maurizio Palesi 96
Indirect AddressingIndirect AddressingIndirect addressing (special cases)
Maurizio Palesi 97
Indirect Addressing - ExampleIndirect Addressing - Example
Indirect addressing with predisplacement add
*+ARn(disp)
Maurizio Palesi 98
Indirect Addressing - ExampleIndirect Addressing - Example
Indirect addressing with postdisplacement add and modify
*++ARn(disp)
Maurizio Palesi 99
Immediate AddressingImmediate Addressing The operand is a 16-bit (short) or 24-bit (long) immediate
value contained in the instruction word Depending on the data types assumed for the instruction,
the immediate operand can beA 2s-complement integeran unsigned integer, or a floating-point number
SUB 1, R0
00 0000 0000R0
Before Instruction
00 FFFF FFFFR0
After Instruction
Maurizio Palesi 100
PC-relative AddressingPC-relative Addressing It adds the contents of the 16 or 24 LSBs of the instruction
word to the PC register The assembler takes the src (a label or address) specified
by the user and generates a displacementThe displacement is equal to[src − (instruction address+1)]
BU Label
1002PC
Before InstructionDecode phase
1005PC
After InstructionExecution phase
; pc=1001h, ; Label = 1005h; --> displacement = 3
Maurizio Palesi 101
Circular AddressingCircular Addressing Many DSP algorithms, such as convolution and correlation, require a
circular buffer in memory In convolution and correlation, the circular buffer acts as a sliding
window that contains the most recent data to process As new data is brought in, the new data overwrites the oldest data
Logical representation Physical representation
Start
End
Maurizio Palesi 102
Circular AddressingCircular Addressing
Logical representation Physical representation
Start
End
value0 value0
value1
value1
value2
value2
Maurizio Palesi 103
Circular AddressingCircular Addressing
Logical representation Physical representation
Start
End
value5
value5
value0 value0
value1
value1
value2
value2
value3
value3
value4
value4
value6 value6
value7
value7
Maurizio Palesi 104
ImplementationImplementationBK Length of the circular buffer
(16 bit, <64K)
The K LSB of the start address of the buffer must be 0K is such that 2K > buffer length
Length of buffer BK register value Starting address of buffer31 31 XXXXXXXXXXXXXXXXXXX0000032 32 XXXXXXXXXXXXXXXXXX000000
1024 1024 XXXXXXXXXXXXX00000000000
Maurizio Palesi 105
Algorithm for Circular AddressingAlgorithm for Circular Addressing
Start
End
Buffer length(BK)
Index
if (0 ≤ index+step < BK)
index = index+step;
else if (index+step ≥ BK)
index = index+step-BK;
else
index = index+step+BK;
Maurizio Palesi 106
Circular Addressing - ExampleCircular Addressing - Example
*ARn++(disp)% ; addr = ARn; ARn = circ(ARn+disp)
*AR0++(5)%; Now AR0 is circ(0+5)=5
2345678...
01
MemoryAddr
*AR0++(2)%; Now AR0 is circ(5+2)=1
*AR0−−(3)%; Now AR0 is circ(1-3)=4
*AR0++(6)%; Now AR0 is circ(4+6)=4
*AR0−−%; Now AR0 is circ(4-1)=3
; AR0 is 0; BK is 6
*AR0%
Maurizio Palesi 107
ISA OverviewISA OverviewThe instruction set contains 113 instructions
Load and store2-operand arithmetic/logical3-operand arithmetic/logicalProgram controlInterlocked operationsParallel operations
Maurizio Palesi 108
Load & StoreLoad & Store The ’C3x supports 13 load and store instructions
Load a word from memory into a registerStore a word from a register into memoryManipulate data on the system stack
Maurizio Palesi 109
2-Operand Instructions2-Operand InstructionsThe ’C3x supports 35 2-operand arithmetic
and logical instructionsThe two operands are the source and
destinationThe source operand can be
Memory wordRegisterPart of the instruction word
The destination operand is always a register
Maurizio Palesi 110
2-Operand Instructions2-Operand Instructions
Maurizio Palesi 111
3-Operand Instructions3-Operand Instructions 3-operand instructions have two source operands and a destination
operand A source operand can be
Memory word Register
The destination is always a register
Maurizio Palesi 112
Program-Control InstructionsProgram-Control Instructions
The program-control instruction group consists of all of those instructions that affect program flow
Maurizio Palesi 113
Low-Power Control InstructionsLow-Power Control Instructions
The low-power control instruction group consists of 3 instructions that affect the low-power modes
Maurizio Palesi 114
Interlocked-Operations InstructionsInterlocked-Operations Instructions
The five interlocked-operations instructions support multiprocessor communication and the use of external signals to allow for powerful synchronization mechanisms
They also ensure the integrity of the communication and result in a high-speed operation
Maurizio Palesi 115
Parallel OperationsParallel Operations The 13 parallel-operations instructions make a
high degree of parallelism possible Some of the ’C3x instructions can occur in pairs
that are executed in parallelParallel loading of registersParallel arithmetic operationsArithmetic/logical instructions used in parallel with a
store instruction
Maurizio Palesi 116
Parallel OperationsParallel Operations Parallel arithmetic with store instructions
Many other
Maurizio Palesi 117
Parallel OperationsParallel Operations Parallel load instructions
Parallel multiply and add/subtract instructions
Maurizio Palesi 118
Repeat ModesRepeat Modes The repeat modes can implement zero-overhead
looping For many algorithms, most execution time is spent
in an inner kernel of code Two instructions
RPTB repeats a block of codeRepeats execution of a block of code a specified number of
times
RPTS repeats a single instructionFetches a single instruction once and then repeats its
execution a number of times– Since the instruction is fetched only once, bus traffic is minimized
Maurizio Palesi 119
Repeat Mode RegistersRepeat Mode Registers RS Repeat start-address register
Holds the address of the first instruction of the block of code to be repeated
RE Repeat end-address registerHolds the address of the last instruction of the
block of code to be repeated (RE≥RS) RC Repeat-counter register
Contains 1 less than the number of times the block remains to be repeatedFor example, to execute a block n times, load
n−1 into RC
Maurizio Palesi 120
BranchesBranchesStandard branchesDelayed branchesConditional delayed branches
Maurizio Palesi 121
Standard BranchesStandard BranchesEmpty the pipeline before performing the
branchResulting in a ’C3x branch taking four cycles
Included in this class are repeats, calls, returns, and traps
Maurizio Palesi 122
Delayed BranchesDelayed BranchesDo not empty the pipeline
Execute the next three instructions before the program counter is modified by the branch
Results in a branch that requires only a single cycle
Maurizio Palesi 123
Conditional Delayed BranchesConditional Delayed Branches
Use the conditions that exist at the end of the instruction immediately preceding the delayed branch
They do not depend on the instructions following the delayed branch
Maurizio Palesi 124
Calls, Traps, and ReturnsCalls, Traps, and Returns Calls and traps provide a means of executing a subroutine
or function while providing a return to the calling routine
Call and Trap instructions store the value of the PC on the stack before changing the PC’s contents
Return instructions use the value on the stack to return execution from traps and calls
Functionally, calls and traps accomplish the same task A subfunction is called and executed, and control is then returned to
the calling function
In Traps Interrupts are automatically disabled when a trap is executed This allows critical code to execute without risk of being interrupted
Traps are generally terminated with a RETI instruction to reenable interrupts
Maurizio Palesi 125
ExamplesExamplesFIR FilterMatrix-Vector Multiplication
Maurizio Palesi 126
Data Structure for FIR FiltersData Structure for FIR Filters Circular addressing is especially useful for the implementation of FIR filters
h(N-3)
h(2)
h(1)
h(0)
h(N-1)
h(N-2)
Impulse response
AR0
x(2)
x(N-3)
x(N-2)
x(N-1)
x(0)
x(1)
Input samples
AR1
Maurizio Palesi 127
FIR Filter CodeFIR Filter Code* Impulse Response .sect ”Impulse_Resp”H .float 1.0 .float 0.99 .float 0.95 ... .float 0.1
* Input BufferX .usect ”Input_Buf”,128
.dataHADDR .word HXADDR .word XN .word 128
0.99
...
0.1
...
?
?
...
?
...1.0
Memory
...H
X
128
H
Addr
Impulse_Resp
Input_Buf
X
HADDR
XADDR
N
...
Maurizio Palesi 128
FIR Filter Code (cnt’d)FIR Filter Code (cnt’d)* Initialization
LDP HADDRLDI @N,BK ; Load block sizeLDI @HADDR,AR0 ; Load pointer to impulse responseLDI @XADDR,AR1 ; Load pointer to input samples
TOP LDF IN,R3 ; Read input sampleSTF R3,*AR1++% ; Store the samplesLDF 0,R0 ; Initialize R0LDF 0,R2 ; Initialize R2
* FilterRPTS N−1 ; Repeat next instructionMPYF3 *AR0++%,*AR1++%,R0|| ADDF3 R0,R2,R2 ; MACADDF R0,R2 ; Last product accumulated
STF R2,Y ; Save resultB TOP ; Repeat
Maurizio Palesi 129
Matrix-Vector MultiplicationMatrix-Vector Multiplication
[P]K× 1=[M]K× N × [V]N× 1
for (i=0; i<K; i++){
p[i] = 0for (j=0; j<N; j++)
p[i] = p[i] + m[i,j] * v[j]}
Maurizio Palesi 130
Matrix-Vector MultiplicationMatrix-Vector Multiplication Data memory organization
Maurizio Palesi 131
Matrix-Vector MultiplicationMatrix-Vector Multiplication* AR0 : ADDRESS OF M(0,0)* AR1 : ADDRESS OF V(0)* AR2 : ADDRESS OF P(0)* AR3 : NUMBER OF ROWS - 1 (K-1)* R1 : NUMBER OF COLUMNS - 2 (N-2)
MAT LDI R1,IR0 ; Number of columns-2 -> IR0
ADDI 2,IR0 ; Number of columns -> IR0
ROWS LDF 0.0,R2 ; Initialize R2
MPYF3 *AR0++(1),*AR1++(1),R0 ; m(i,0) * v(0) -> R0RPTS R1 ; Multiply a row by a column
MPYF3 *AR0++(1),*AR1++(1),R0 ; m(i,j) * v(j) -> R0 || ADDF3 R0,R2,R2 ; m(i,j-1) * v(j-1) + R2 -> R2
SUBI 1,AR3BNZD ROWS ; Counts the no. of rows left
ADDF R0,R2 ; Last accumulate
STF R2,*AR2++(1) ; Result -> p(i)
NOP *––AR1(IR0) ; Set AR1 to point to v(0)
Delayslot
Maurizio Palesi 132
C Programming TipsC Programming Tips After writing your application in C language, debug the
program and determine whether it runs efficiently If the program does not run efficiently
Use the optimizer with –o2 or –o3 options when compilingUse registers to pass parameters (–ms compiling option)Use inlining (–x compiling option)Remove the –g option when compilingFollow some of the efficient code generation tips
Use register variables for often-used variablesPrecompute subexpressionsUse *++ to step through arraysUse structure assignments to copy blocks of data
Maurizio Palesi 133
Use Register VariablesUse Register Variables Exchange one object in memory with another
register float *src, *dest, temp;
do {temp = *++src;*src = *++dest;*dest = temp;
} while (––n);
Maurizio Palesi 134
Precompute Subexpression and use *++Precompute Subexpression and use *++
main() {float a[10], b[10];int i;for (i = 0; i < 10; ++i)
a[i] = (a[i] * 20) + b[i];}
main() {float a[10], b[10];int i;register float *p = a, *q = b;for (i = 0; i < 10; ++i)
*p++ = (*p * 20) + *q++;}
19 cycles
12 cycles
Maurizio Palesi 135
Structure AssignmentsStructure Assignments The compiler generates very efficient code for structure
assignmentsNest objects within structures and use simple assignments to copy
them
int x1, y1, c1;int x2, y2, c2;
x1 = x2;y1 = y2;c1 = c2;
struct Pixel { int x, y, c;};
struct Pixel p1, p2;
p1 = p2;
Maurizio Palesi 136
Hints for Assembly CodingHints for Assembly CodingUse delayed branches
Delayed branches execute in a single cycleRegular branches execute in four cyclesThe next three instructions are executed
whether the branch is taken or notIf fewer than three instructions are required, use the
delayed branch and append NOPs– A reduction in machine cycles still occurs
Maurizio Palesi 137
Hints for Assembly CodingHints for Assembly CodingApply the repeat single/block construct
In this way, loops are achieved with no overhead
Note that using RPTS instruction the executed instruction is not refetched for executionThis frees the buses for operand fetches
Maurizio Palesi 138
Hints for Assembly CodingHints for Assembly CodingUse parallel instructionsMaximize the use of registersUse the cacheUse internal memory instead of external
memoryAvoid pipeline conflicts