Design of Embedded DSP Processors - Linköping … and...Code integration • Oh my god! Where are...
Transcript of Design of Embedded DSP Processors - Linköping … and...Code integration • Oh my god! Where are...
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 1
Design of Embedded DSP
Processors
Unit 8: Firmware design
and benchmarking
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 2
Contents• Introduction to FW and its coding flow
1. Application modeling under HW constraints
2. Stream-kernel (master / slave) programming
3. Programming algorithm / computing kernels
4. Assembly code implementation
5. Code benchmarking and integration
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 3
FW design flow
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Firmware
• FW is SW with fixed functions and firmed
(not yet HW) in a system.
• FW permanently installed in non-volatile
memory, rarely changed.
– Typical baseband firmware in SDR processor,
video CODEC firmware in TV, in Surveillance
camera ……
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 4
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 5
FW coding / implementation flow
Docu
men
ts, S
TD
Hig
h l
evel
b
ehav
ior
mo
del
ing
HW
con
stra
ints
HW related
C-modeling
Code
inspection Assembly
programmin
g
C-compiler
Code
inspection
Source
xx.asm
Source
xxx.C
C-c
om
pil
er
Ass
emb
ler objective file
xxx.bin
objective
file
xxx.bin O
bje
ct l
inker
LIB
Sim
ula
tor
deb
ug
ger
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 6
The role of Programmer / Compiler
1. Programmer: partition and assign to different
instruction domains /streams, domain coding &
debugging, and integrate heterogeneous codes
In an instruction stream, a programmer codes kernel
codes to approach the best performance
2. A compiler translate C to codes of its machine
language and optimize the translation.
3. API is finally added by a programming model
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 7
Understand Applications
Product Portable audio player DTV and video player
Application
components
RTOS Audio
decoder
Voice
encoder
DVB
modem
Video
decoder
Function
kernels
Filter
(I)DCT Huffman
decoder
Waveform
generator …
…
…
(I)FFT
… …
…
Innermost loop design
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Job
balancing
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 8
Task partition, allocation, scheduling
before coding / compiling
• Mostly do it by hand, rarely available tools.
– Based on computing cost prediction (code profile), algorithm features, & HW constraints
• There are different partition objectives:
– to reach the highest performance
– lowest power (lower speed, less communication)
– Lowest memory cost
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
FW Design flow
Un
der
stan
din
g a
pp
lica
tio
ns
HW
Aw
are
alg
ori
thm
sel
ecti
on
s
Fin
ite
len
gth
des
ign
Co
din
g f
init
e le
ng
th f
irm
war
e
Hig
h l
evel
lan
gu
age
mo
del
ing
Ex
po
se m
emo
ry c
ost
s
Co
din
g F
W w
ith
mem
ory
co
sts
Ru
n t
ime
bu
dg
et
Co
din
g c
ycl
e ac
cura
te F
W
Re-
allo
cata
ble
ass
emb
ly c
od
ing
Bin
ary
mac
hin
e co
de
Behavior modeling
Bit accurate
modeling
Memory accurate
modeling
Timing
budget
Assembly
coding
Simplified firmware design
flow
Design
entry 1
Design
entry 3
Design
entry 2
embedded.com
codehelp.co.uk
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 10
High level FW design
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 11
Algorithm selection
• Function! Do not forget your function!
• Select algorithms for the architecture (adapt to
HW ① advanced feature and ② constraints)
• Reuse of available algorithms (SW reuse)
• Minimize computing cost (innermost loop)
• Minimize code cost (of high level codes)
• Minimize data accesses (mostly focused today)
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Stream-kernel based programming• Stream
– The main consists of FSM, prepare & use subroutines
– Prolog (start a subrouting in device)
– Epilog (finish subrouting in device, handover results)
– API insertion: CUDA, OpenCL, OpenGL, OpenMP
• Interwork, task/resource management, and function call
• Kernel
– Speed up innermost loops by assembly level coding
– That what we are going to do today!
9/27/2017 For teachers using the book 12
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 13
Assembly kernel coding
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 14
Finite Length• Finite Length
– Integer/Fractional data with limited dynamic range
– Low cost/power with acceptable quantization noise
• Technique
– Integer/fractional guard bits for iterations
– Scaling and Round before truncation
– Saturation instead of exception
– Block floating, half precision floating point
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 15
Added quality control codes
Scaling Scaling Scaling Scaling
Main task flow
coefficient paramet
er
scaling
Scaling flow – tasks are executed only after running the measurement flow
DSP
DSP
DSP
Measurement flow – tasks are executed only when needed
MAX AVG counters
Scaling
DS
P
DE
C
Fil
ter
Fil
ter
A/D D/A
scaling
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 For teachers using the book 16
Firmware in a fixed point processing Start
Program booting and parameter initialization
Loading inputs and pre-processing
Post processing, result storing
In case needed After measurement Default
Main task flow – Executing the kernel part algorithms
Data quality control flow
No
operation Scaling
flow
Measurement
flow
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 17
Bit accurate behavior coding
•Fractional v.s. integer
A=0.25 v.s. 8192=0.25*32768
•Mask including guard:
A=(long)(int)A&0001FFFF
•Arithmetic, for example:
yn= yn+((long)(int)A*xn>>15)
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 18
Bit accurate specification HW Ceiling
Headroom
0dB
Feet-room
ADC resolution
Scale up to avoid
accumulated
quantization errors
MAX
gain
result
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 19
Measuring Data Quality
N)rR(...)rR()rR(D 2
nn
2
22
2
11RMS
|}||,||,...,||,{| 112211 nnnnABSMAX rRrRrRrRMAXD
dBVD
MAXlog20SNR
RMS
headroom10
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Memory and memory access
• Using SPM instead of cache
– Expose flexibilities for data access
– Minimize memory cost or access cost?
• Memory hardware constraints may induce
extra execution time
– Code loading, load/store data, swapping data
when memory size is not sufficient
– Adapt your implementation to memory HW
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 20
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 21
Memory efficiency• 1. Minimize memory costs
– Low program cost, low data memory costs
• 2. Minimum memory access costs
– Minimize on off chip swapping (SPM efficiency?)
– Multi tasks/threads sharing data
– Memory block re-connect (sharing out/in FIFO)
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 22
Memory efficient
• Select algorithms with full memory
access predictability. Much data can
thus be stored in the off-chip memory
and pre-fetch it when needed.
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 23
Reduce register cost
a
b
c
d s
t
u
1 2 3 Cycles 4 5 6 7 8
v
x
y
ACR0
ACR1
4 5 6 5 4 3 2
Number of registers required
R0
R1
R2
R3
R4
R5
R0
R3
R1
R2
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 24
Real-time Firmware Implementation
• Correct = correct result + results available in
time
• Find critical path & time constraints, WCET,
minimize memory uncertainty
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 25
Real time• Real Time
– Cycle true: based on known cycle count
– Short distance between
• WCET: Worst Case Execution Time
• BCET: Best Case Execution Time
– Dynamic / static run time analysis
– Quality coding of innermost loops
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 26
Code compiling
• The closer the C-code to HW, the better can
be the C-compiler result
• Understand the compiler in detail.
• Annotate enough “Compiler known”
• Do we trust compiler
– Functional verification of compiled code
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 27
Low cycle cost assembly kernels
• Focus on low cycle cost of inner most loops!
– Use REPEAT instead of conditional jump
– Loop unrolling & low cycle cost scheduling!
– Do not care much the code cost of inner loop!
– Use as much vector instruction as possible
– Keep useful data in RF as long time as possible
C Algorithms for Real-Time DSP, Prentice Hall, ISBN 0133373533
Hacker's Delight, Addison-Wesley, ISBN 0201914654
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Low cycle cost assembly kernels
Implementation
models
Basic Video Baseband HPC
Functio
n
Matrix
Larg
em
atrix
Tran
sform
Larg
ersize
T
Filter
ISP
CO
DE
C
Post
pro
cess
Codin
g
FE
C
Deco
din
g
Chan
nel
Sto
rage
FS
M
Sortin
g
Search
ing
Taylor series √ √
Task partition √ √ √ √ √
Data partition √ √ √ √ √ √
Grouping √ √ √ √ √ √
Pipeline √ √
Recursive √ √ √
SPMD √ √ √ √ √ √ √ √ √ √
Master-slave √ √ √ √ √ √ √ √ √ √ √
Fork-join BSPM
Data sharing √
Reading:A Pattern Language for Parallel Programming
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Reading:A Pattern Language
for Parallel Programming
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 29
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Kernel programming tips
• CISC (if available) V.S. RISC (always there)
– RISC: Memory→RF→Computing→RF→Memory
– DSP loop: Memory → Computing → RF
• Trade off 10% - 90%, prolog, epilog, iterations
– Minimize cycle cost by acceleration / quality coding
• Amdahl’s law:
– To minimize the parts can not run in parallel
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 30
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 31
Code integration
• Oh my god! Where are cycles consumed!
– Extra cycles are needed during SW integration
– Be sure you predicted / accounted cycles during
early SW plan / design phases
• Extra cost can come from (not limited to)
– Control: prolog/epilog, asynch, synchronization
– Data dependencies: loading, waiting for data
available
– Communications: master/device (slave, I/O)
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 32
Assembly-level Release
• WCET (the worst-case execution time) should be
analyzed based on static timing analysis
– Remove paths which can never be true
– Avoid releasing code based on dynamic timing
(code simulation)
• Stack overflow should be checked if multiple
tasks are running simultaneously and associated
with many interrupts and subroutine calls
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 33
Benchmark
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 34
Benchmark
• Benchmark is a type of program to measure
the performance of a processor.
• Benchmarking is the execution of such type
of programs which allows processor users to
measure machine clock cycles consumed by
a specific section of code.
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
2017/9/27 Unit 8 of TSEA26 – 2017 –H1 35
ASIP design flow
Source code analysis, Decision for ISA of ASIP
Design instruction set and toolchain for prototyping
Benchmark (kernel), evaluate microarchitecturte
Microarchitecture design, VLSI design, Verifications
Change
ISA?Satisfied?
Yes
No
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 36
Third Party Benchmarks
• BDTI: Berkeley Design Tech Incorporation
– Hand written assembly by professional engineers
– http://www.bdti.com
• EEMBC (the EDN Embedded Microprocessor
Benchmark Consortium), five classes:
– automotive/industrial, consumer, networking, office
automation, and telecommunication
– http://www.eembc.org
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 37
Benchmark example: for a simple DSPAlgorithm Kernels Number of
samples
Taps Total cycle
cost
Kernel cycle
cost
P-Mem
cost
D-mem
cost
Block transfer 40 ------ 88
256 point complex FFT 256 ------ 18763
Single data sample FIR 1 16 30
Frame FIR (multi samples) 40 16 921
Complex FIR 40 16 3696
IIR biquad type I 40 40 2450
LMS Adaptive FIR 40 16 4384
16-bit division ------ ------ 67
Vector add 40 ------ 131
Vector dot 40 ------ 53
Vector Max 40 ------ 55
Floating to fixed 1 ------ 17
Fixed to floating 1 ------ 58
8X8DCT 64 ------ 874
FSM (Packet classification) 1 ------ 8
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 38
How to write a benchmark
• All operation, operands, and results are native length.
• Try to keep high precision in MAC.
• Round and saturate before storing data from MAC (after truncation) to memory or registers.
• All programs are implemented by experienced DSP firmware engineers.
• Complete program including loop prolog and epilog, program initialization, and wrapping up.
• All related memory access cost shall be included.
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 39
An example: FIR benchmark
• A FIR filter is a weighted sum of a finite set of
inputs.
• y(n)=
• x(n) is the input
• y(n) is the output
• ak is a vector as the filter coefficients
)(1
0
knxam
k
k
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 40
An example: FIR benchmark
x(n) T
+
T T ……
a0 an a1
y(n)
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 41
An example: FIR benchmark
• Behavior level code (single sample FIR){
Reset ACR
DM(DP) <= The latest Sample
DP <= DP + 1 /*Store latest sample in computing buffer, and then load the oldest sample, using same pointer. */
For i=0 to 15 do {
ACR =< ACR + DM(DP)*TM(TP)
/* 16-tap convolution for a sample */
DP <= DP + 1 /* implied modulo DP */
TP <= TP + 1; Round and Sat ACR; Output result; }
Store the data pointer DP.
}
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 42
An example: FIR benchmark
---------------- The first part of the program -------------------
Set AP1, $SEG_FIR -- load segment (block) address to DM1pointer
Set LoopR, N -- load the loop counter
-- filter program parameters are stored in DM1
Set R15, $Resultpt -- Result pointer to R15
Set AP0, $Datapt -- data pointer to AP0
Set BTR, $Bottom -- FIFO bottom pointer
Set TPR, $Top -- FIFO top pointer
Set AP1, $Coeffpt -- coefficient pointer to AP1
----------------- The prolog consumes 7 cycles -------------------
Repeat N -- Number of samples
--for every data sample
Store DM0(AP0++), R1 -- a sample data from R1 to DM0(DM0pointer)
CLR ACR1 -- Clean the accumulator buffer ACR1
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 43
An example: FIR benchmark
----------- The second part of the program
CONV ACR1 SSF 16 DM0(AP0) DM1(AP1)
-- Signed fractional convolution
-- iteration uses N+1 = 16(17) clock cycles
----------- Convolution iteration
--consumes 16 cycles if the following
--instruction does not use ACR1
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 44
An example: FIR benchmark
----------------- The third part of the program ------------------
PostOP R1, ACR1 -- Sat Round(ACR), store result in ACRH and R1
Store DM1(R15), R1 -- Store result in R1 to DM1(GRX++)
INC R15 -- position to the next result
End repeat
Store DM1(AP1++), R15 –- Store Y pointer after updating result Y
Store DM1(AP1), AP1 –- Store X pointer of the FIFO filter
----------------- The epilog consumes 6 cycles -------------------
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 For teachers using the book 45
Example: Frame sample FIR • C-code: 40 samples filtered by a 16-tap FIR
Th
e d
ata
mem
ory
sp
ace
Top
Bottom
Th
e F
IFO
bu
ffer
DM
Btm + 0
Btm + 1
…
Btm + 14
Btm + 15
X (n-15)
X (n-14)
...
X (n-1)
X (n)
R0 X (n)
X (n-15)
...
X (n-2)
X (n-1)
MIN address
MAX address
R0
X (n-1)
X (n)
X (n-15)
…
X (n-2)
R0
X (n-2)
X (n-1)
X (n)
X (n-15)
…
R0
Read a new value to replace
the oldest value in the
buffer: x (n-15)
R5
R7
R5
R7
R5
R7
R5
R7
State 0 State 1
State 2 State 3
Increase the address counter R0. It points
to the (next) oldest value in the FIFO.
Replace the (next) oldest value
x (n-15) with the new incoming value…
X(0
)
X(1
)
X(2
)
X(3
)
X(4
)
X(5
)
X(1
3)
X(1
4)
X(1
5)
… Push new data
once a FIR tap Removed
data
Load each data once for signal processing of a FIR tap
(a) The FIFO behavior
(b) The FIFO implementation
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 46
Example: Frame sample FIR • C-code: 16-tap FIR filter runs 40 samples
– Kernel cycle cost 17x40=680 cycles
– Prolog and epilog of inner loop: 40x5=200 cycles
– Prolog and epilog of the top loop: 9 cycles
• Typical BDTI benchmarking
Algorithm Innermost loop
pro epilogue
Kernel
cycle cost
Total
code cost
DM
cost
40 sample
16-tap FIR5x40=200 17x40 =
680
889 65
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 47
Review on today’s discussions• Quality firmware design is based on rich FW experiences,
deep understanding of applications, and HW.
• A formal design will never offer quality code.
• Firmware design can be divided into three steps:
– the algorithm selection and behavior modeling,
– the C-coding under hardware constraint,
– the assembly language coding
• Benchmark fundamentals
• Learn heterogeneous programming model in other courses
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Summarize what/how to learn
System understanding FW coding Integration
Assembly coding toolsFurther understanding tools
after reading chapter 18Debug skill Verification
Firmware plan & design
Skills to select algorithms
Bit accurate
Memory accurate
Cycle accurate
plan vs code
To find extra cycle cost
which you could not
find out during coding
subroutines
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 48
Skills
Con
cep
ts
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Self reading after the lecture
• Your hardware knowledge will help you to
design quality firmware, try to summarize it
by yourself
• Reading Chapter 18 and chapter 9
1. Collect experiences to design quality innermost
loop codes.
2. How to accelerate innermost loop in HW.
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 49
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
Exciting time now!
Let us discuss• Whatever you want to discuss and
related to HW
• You will have the chance after each
lecture (Fö), do take the chance!
• Prepare your Qs for the next time
9/27/2017 Unit 8 of TSEA26 – 2017 –H1 50
by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®
LOGO
Dake Liu, Room 556 coridoor B, Hus-B, phone 281256, [email protected]
Welcome to ask any
questions you want to
• I can answer
• Or discuss together
• I want to know what you want