Design of Embedded DSP Processors - Linköping … and...Code integration • Oh my god! Where are...

by Dake Liu: [email protected]© Copyright of Linköping University, all rights reserved ®

9/27/2017 Unit 8 of TSEA26 – 2017 –H1 1

Design of Embedded DSP

Processors

Unit 8: Firmware design

and benchmarking

mailto:[email protected]


9/27/2017 Unit 8 of TSEA26 – 2017 –H1 2

Contents• Introduction to FW and its coding flow

1. Application modeling under HW constraints

2. Stream-kernel (master / slave) programming

3. Programming algorithm / computing kernels

4. Assembly code implementation

5. Code benchmarking and integration



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 3

FW design flow



Firmware

• FW is SW with fixed functions and firmed

(not yet HW) in a system.

• FW permanently installed in non-volatile

memory, rarely changed.

– Typical baseband firmware in SDR processor,

video CODEC firmware in TV, in Surveillance

camera ……

9/27/2017 Unit 8 of TSEA26 – 2017 –H1 4



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 5

FW coding / implementation flow

Docu

men

ts, S

TD

Hig

h l

evel

b

ehav

ior

mo

del

ing

HW

con

stra

ints

HW related

C-modeling

Code

inspection Assembly

programmin

g

C-compiler

Code

inspection

Source

xx.asm

Source

xxx.C

C-c

om

pil

er

Ass

emb

ler objective file

xxx.bin

objective

file

xxx.bin O

bje

ct l

inker

LIB

Sim

ula

tor

deb

ug

ger



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 6

The role of Programmer / Compiler

1. Programmer: partition and assign to different

instruction domains /streams, domain coding &

debugging, and integrate heterogeneous codes

In an instruction stream, a programmer codes kernel

codes to approach the best performance

2. A compiler translate C to codes of its machine

language and optimize the translation.

3. API is finally added by a programming model



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 7

Understand Applications

Product Portable audio player DTV and video player

Application

components

RTOS Audio

decoder

Voice

encoder

DVB

modem

Video

decoder

Function

kernels

Filter

(I)DCT Huffman

decoder

Waveform

generator …

…

…

(I)FFT

… …

…

Innermost loop design



Job

balancing

9/27/2017 Unit 8 of TSEA26 – 2017 –H1 8

Task partition, allocation, scheduling

before coding / compiling

• Mostly do it by hand, rarely available tools.

– Based on computing cost prediction (code profile), algorithm features, & HW constraints

• There are different partition objectives:

– to reach the highest performance

– lowest power (lower speed, less communication)

– Lowest memory cost



FW Design flow

Un

der

stan

din

g a

pp

lica

tio

ns

HW

Aw

are

alg

ori

thm

sel

ecti

on

s

Fin

ite

len

gth

des

ign

Co

din

g f

init

e le

ng

th f

irm

war

e

Hig

h l

evel

lan

gu

age

mo

del

ing

Ex

po

se m

emo

ry c

ost

s

Co

din

g F

W w

ith

mem

ory

co

sts

Ru

n t

ime

bu

dg

et

Co

din

g c

ycl

e ac

cura

te F

W

Re-

allo

cata

ble

ass

emb

ly c

od

ing

Bin

ary

mac

hin

e co

de

Behavior modeling

Bit accurate

modeling

Memory accurate

modeling

Timing

budget

Assembly

coding

Simplified firmware design

flow

Design

entry 1

Design

entry 3

Design

entry 2

embedded.com

codehelp.co.uk



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 10

High level FW design



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 11

Algorithm selection

• Function! Do not forget your function!

• Select algorithms for the architecture (adapt to

HW ① advanced feature and ② constraints)

• Reuse of available algorithms (SW reuse)

• Minimize computing cost (innermost loop)

• Minimize code cost (of high level codes)

• Minimize data accesses (mostly focused today)



Stream-kernel based programming• Stream

– The main consists of FSM, prepare & use subroutines

– Prolog (start a subrouting in device)

– Epilog (finish subrouting in device, handover results)

– API insertion: CUDA, OpenCL, OpenGL, OpenMP

• Interwork, task/resource management, and function call

• Kernel

– Speed up innermost loops by assembly level coding

– That what we are going to do today!

9/27/2017 For teachers using the book 12



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 13

Assembly kernel coding



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 14

Finite Length• Finite Length

– Integer/Fractional data with limited dynamic range

– Low cost/power with acceptable quantization noise

• Technique

– Integer/fractional guard bits for iterations

– Scaling and Round before truncation

– Saturation instead of exception

– Block floating, half precision floating point



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 15

Added quality control codes

Scaling Scaling Scaling Scaling

Main task flow

coefficient paramet

er

scaling

Scaling flow – tasks are executed only after running the measurement flow

DSP

DSP

DSP

Measurement flow – tasks are executed only when needed

MAX AVG counters

Scaling

DS

P

DE

C

Fil

ter

Fil

ter

A/D D/A

scaling




Firmware in a fixed point processing Start

Program booting and parameter initialization

Loading inputs and pre-processing

Post processing, result storing

In case needed After measurement Default

Main task flow – Executing the kernel part algorithms

Data quality control flow

No

operation Scaling

flow

Measurement

flow



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 17

Bit accurate behavior coding

•Fractional v.s. integer

A=0.25 v.s. 8192=0.25*32768

•Mask including guard:

A=(long)(int)A&0001FFFF

•Arithmetic, for example:

yn= yn+((long)(int)A*xn>>15)



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 18

Bit accurate specification HW Ceiling

Headroom

0dB

Feet-room

ADC resolution

Scale up to avoid

accumulated

quantization errors

MAX

gain

result



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 19

Measuring Data Quality

N)rR(...)rR()rR(D 2

nn

2

22

2

11RMS

|}||,||,...,||,{| 112211 nnnnABSMAX rRrRrRrRMAXD

dBVD

MAXlog20SNR

RMS

headroom10



Memory and memory access

• Using SPM instead of cache

– Expose flexibilities for data access

– Minimize memory cost or access cost?

• Memory hardware constraints may induce

extra execution time

– Code loading, load/store data, swapping data

when memory size is not sufficient

– Adapt your implementation to memory HW

9/27/2017 Unit 8 of TSEA26 – 2017 –H1 20



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 21

Memory efficiency• 1. Minimize memory costs

– Low program cost, low data memory costs

• 2. Minimum memory access costs

– Minimize on off chip swapping (SPM efficiency?)

– Multi tasks/threads sharing data

– Memory block re-connect (sharing out/in FIFO)



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 22

Memory efficient

• Select algorithms with full memory

access predictability. Much data can

thus be stored in the off-chip memory

and pre-fetch it when needed.



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 23

Reduce register cost

a

b

c

d s

t

u

1 2 3 Cycles 4 5 6 7 8

v

x

y

ACR0

ACR1

4 5 6 5 4 3 2

Number of registers required

R0

R1

R2

R3

R4

R5

R0

R3

R1

R2



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 24

Real-time Firmware Implementation

• Correct = correct result + results available in

time

• Find critical path & time constraints, WCET,

minimize memory uncertainty



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 25

Real time• Real Time

– Cycle true: based on known cycle count

– Short distance between

• WCET: Worst Case Execution Time

• BCET: Best Case Execution Time

– Dynamic / static run time analysis

– Quality coding of innermost loops



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 26

Code compiling

• The closer the C-code to HW, the better can

be the C-compiler result

• Understand the compiler in detail.

• Annotate enough “Compiler known”

• Do we trust compiler

– Functional verification of compiled code



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 27

Low cycle cost assembly kernels

• Focus on low cycle cost of inner most loops!

– Use REPEAT instead of conditional jump

– Loop unrolling & low cycle cost scheduling!

– Do not care much the code cost of inner loop!

– Use as much vector instruction as possible

– Keep useful data in RF as long time as possible

C Algorithms for Real-Time DSP, Prentice Hall, ISBN 0133373533

Hacker's Delight, Addison-Wesley, ISBN 0201914654



Low cycle cost assembly kernels

Implementation

models

Basic Video Baseband HPC

Functio

n

Matrix

Larg

em

atrix

Tran

sform

Larg

ersize

T

Filter

ISP

CO

DE

C

Post

pro

cess

Codin

g

FE

C

Deco

din

g

Chan

nel

Sto

rage

FS

M

Sortin

g

Search

ing

Taylor series √ √

Task partition √ √ √ √ √

Data partition √ √ √ √ √ √

Grouping √ √ √ √ √ √

Pipeline √ √

Recursive √ √ √

SPMD √ √ √ √ √ √ √ √ √ √

Master-slave √ √ √ √ √ √ √ √ √ √ √

Fork-join BSPM

Data sharing √

Reading：A Pattern Language for Parallel Programming



Reading：A Pattern Language

for Parallel Programming

9/27/2017 Unit 8 of TSEA26 – 2017 –H1 29



Kernel programming tips

• CISC (if available) V.S. RISC (always there)

– RISC: Memory→RF→Computing→RF→Memory

– DSP loop: Memory → Computing → RF

• Trade off 10% - 90%, prolog, epilog, iterations

– Minimize cycle cost by acceleration / quality coding

• Amdahl’s law:

– To minimize the parts can not run in parallel

9/27/2017 Unit 8 of TSEA26 – 2017 –H1 30



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 31

Code integration

• Oh my god! Where are cycles consumed!

– Extra cycles are needed during SW integration

– Be sure you predicted / accounted cycles during

early SW plan / design phases

• Extra cost can come from (not limited to)

– Control: prolog/epilog, asynch, synchronization

– Data dependencies: loading, waiting for data

available

– Communications: master/device (slave, I/O)



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 32

Assembly-level Release

• WCET (the worst-case execution time) should be

analyzed based on static timing analysis

– Remove paths which can never be true

– Avoid releasing code based on dynamic timing

(code simulation)

• Stack overflow should be checked if multiple

tasks are running simultaneously and associated

with many interrupts and subroutine calls



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 33

Benchmark



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 34

Benchmark

• Benchmark is a type of program to measure

the performance of a processor.

• Benchmarking is the execution of such type

of programs which allows processor users to

measure machine clock cycles consumed by

a specific section of code.



2017/9/27 Unit 8 of TSEA26 – 2017 –H1 35

ASIP design flow

Source code analysis, Decision for ISA of ASIP

Design instruction set and toolchain for prototyping

Benchmark (kernel), evaluate microarchitecturte

Microarchitecture design, VLSI design, Verifications

Change

ISA?Satisfied?

Yes

No



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 36

Third Party Benchmarks

• BDTI: Berkeley Design Tech Incorporation

– Hand written assembly by professional engineers

– http://www.bdti.com

• EEMBC (the EDN Embedded Microprocessor

Benchmark Consortium), five classes:

– automotive/industrial, consumer, networking, office

automation, and telecommunication

– http://www.eembc.org



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 37

Benchmark example: for a simple DSPAlgorithm Kernels Number of

samples

Taps Total cycle

cost

Kernel cycle

cost

P-Mem

cost

D-mem

cost

Block transfer 40 ------ 88

256 point complex FFT 256 ------ 18763

Single data sample FIR 1 16 30

Frame FIR (multi samples) 40 16 921

Complex FIR 40 16 3696

IIR biquad type I 40 40 2450

LMS Adaptive FIR 40 16 4384

16-bit division ------ ------ 67

Vector add 40 ------ 131

Vector dot 40 ------ 53

Vector Max 40 ------ 55

Floating to fixed 1 ------ 17

Fixed to floating 1 ------ 58

8X8DCT 64 ------ 874

FSM (Packet classification) 1 ------ 8



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 38

How to write a benchmark

• All operation, operands, and results are native length.

• Try to keep high precision in MAC.

• Round and saturate before storing data from MAC (after truncation) to memory or registers.

• All programs are implemented by experienced DSP firmware engineers.

• Complete program including loop prolog and epilog, program initialization, and wrapping up.

• All related memory access cost shall be included.



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 39

An example: FIR benchmark

• A FIR filter is a weighted sum of a finite set of

inputs.

• y(n)=

• x(n) is the input

• y(n) is the output

• ak is a vector as the filter coefficients

)(1

0

knxam

k

k



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 40


x(n) T

+

T T ……

a0 an a1

y(n)



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 41


• Behavior level code (single sample FIR){

Reset ACR

DM(DP) <= The latest Sample

DP <= DP + 1 /*Store latest sample in computing buffer, and then load the oldest sample, using same pointer. */

For i=0 to 15 do {

ACR =< ACR + DM(DP)*TM(TP)

/* 16-tap convolution for a sample */

DP <= DP + 1 /* implied modulo DP */

TP <= TP + 1; Round and Sat ACR; Output result; }

Store the data pointer DP.

}



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 42


---------------- The first part of the program -------------------

Set AP1, $SEG_FIR -- load segment (block) address to DM1pointer

Set LoopR, N -- load the loop counter

-- filter program parameters are stored in DM1

Set R15, $Resultpt -- Result pointer to R15

Set AP0, $Datapt -- data pointer to AP0

Set BTR, $Bottom -- FIFO bottom pointer

Set TPR, $Top -- FIFO top pointer

Set AP1, $Coeffpt -- coefficient pointer to AP1

----------------- The prolog consumes 7 cycles -------------------

Repeat N -- Number of samples

--for every data sample

Store DM0(AP0++), R1 -- a sample data from R1 to DM0(DM0pointer)

CLR ACR1 -- Clean the accumulator buffer ACR1



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 43


----------- The second part of the program

CONV ACR1 SSF 16 DM0(AP0) DM1(AP1)

-- Signed fractional convolution

-- iteration uses N+1 = 16(17) clock cycles

----------- Convolution iteration

--consumes 16 cycles if the following

--instruction does not use ACR1



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 44


----------------- The third part of the program ------------------

PostOP R1, ACR1 -- Sat Round(ACR), store result in ACRH and R1

Store DM1(R15), R1 -- Store result in R1 to DM1(GRX++)

INC R15 -- position to the next result

End repeat

Store DM1(AP1++), R15 –- Store Y pointer after updating result Y

Store DM1(AP1), AP1 –- Store X pointer of the FIFO filter

----------------- The epilog consumes 6 cycles -------------------




Example: Frame sample FIR • C-code: 40 samples filtered by a 16-tap FIR

Th

e d

ata

mem

ory

sp

ace

Top

Bottom

Th

e F

IFO

bu

ffer

DM

Btm + 0

Btm + 1

…

Btm + 14

Btm + 15

X (n-15)

X (n-14)

...

X (n-1)

X (n)

R0 X (n)

X (n-15)

...

X (n-2)

X (n-1)

MIN address

MAX address

R0

X (n-1)

X (n)

X (n-15)

…

X (n-2)

R0

X (n-2)

X (n-1)

X (n)

X (n-15)

…

R0

Read a new value to replace

the oldest value in the

buffer: x (n-15)

R5

R7

R5

R7

R5

R7

R5

R7

State 0 State 1

State 2 State 3

Increase the address counter R0. It points

to the (next) oldest value in the FIFO.

Replace the (next) oldest value

x (n-15) with the new incoming value…

X(0

)

X(1

)

X(2

)

X(3

)

X(4

)

X(5

)

X(1

3)

X(1

4)

X(1

5)

… Push new data

once a FIR tap Removed

data

Load each data once for signal processing of a FIR tap

(a) The FIFO behavior

(b) The FIFO implementation



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 46

Example: Frame sample FIR • C-code: 16-tap FIR filter runs 40 samples

– Kernel cycle cost 17x40=680 cycles

– Prolog and epilog of inner loop: 40x5=200 cycles

– Prolog and epilog of the top loop: 9 cycles

• Typical BDTI benchmarking

Algorithm Innermost loop

pro epilogue

Kernel

cycle cost

Total

code cost

DM

cost

40 sample

16-tap FIR5x40=200 17x40 =

680

889 65



9/27/2017 Unit 8 of TSEA26 – 2017 –H1 47

Review on today’s discussions• Quality firmware design is based on rich FW experiences,

deep understanding of applications, and HW.

• A formal design will never offer quality code.

• Firmware design can be divided into three steps:

– the algorithm selection and behavior modeling,

– the C-coding under hardware constraint,

– the assembly language coding

• Benchmark fundamentals

• Learn heterogeneous programming model in other courses



Summarize what/how to learn

System understanding FW coding Integration

Assembly coding toolsFurther understanding tools

after reading chapter 18Debug skill Verification

Firmware plan & design

Skills to select algorithms

Bit accurate

Memory accurate

Cycle accurate

plan vs code

To find extra cycle cost

which you could not

find out during coding

subroutines

9/27/2017 Unit 8 of TSEA26 – 2017 –H1 48

Skills

Con

cep

ts



Self reading after the lecture

• Your hardware knowledge will help you to

design quality firmware, try to summarize it

by yourself

• Reading Chapter 18 and chapter 9

1. Collect experiences to design quality innermost

loop codes.

2. How to accelerate innermost loop in HW.

9/27/2017 Unit 8 of TSEA26 – 2017 –H1 49



Exciting time now!

Let us discuss• Whatever you want to discuss and

related to HW

• You will have the chance after each

lecture (Fö), do take the chance!

• Prepare your Qs for the next time

9/27/2017 Unit 8 of TSEA26 – 2017 –H1 50



LOGO

Dake Liu, Room 556 coridoor B, Hus-B, phone 281256, [email protected]

Welcome to ask any

questions you want to

• I can answer

• Or discuss together

• I want to know what you want


Design of Embedded DSP Processors - Linköping … and...Code integration • Oh my god! Where are...

Documents

Transcript of Design of Embedded DSP Processors - Linköping … and...Code integration • Oh my god! Where are...