Post on 13-Feb-2017
er
Cortex-M4 Architecture and ASM ProgrammingIntroduction
In this chapter programming the Cortex-M4 in assembly and Cwill be introduced. Preference will be given to explaining codedevelopment for the Cypress FM4 S6E2CC, STM32F4 Discov-ery, and LPC4088 Quick Start. The basis for the material pre-sented in this chapter is the course notes from the ARM LiBprogram1.
Overview
• Cortex-M4 Memory Map
– Cortex-M4 Memory Map
– Bit-band Operations
– Cortex-M4 Program Image and Endianness
• ARM Cortex-M4 Processor Instruction Set
– ARM and Thumb Instruction Set
– Cortex-M4 Instruction Set
1.LiB Low-level Embedded NXP LPC4088 Quick Start
Chapt
3
Chapter 3 • Cortex-M4 Architecture and ASM Programming
Cortex-M4 Memory Map
• The Cortex-M4 processor has 4 GB of memory address space
– Support for bit-band operation (detailed later)
• The 4GB memory space is architecturally defined as a num-ber of regions
– Each region is given for recommended usage
– Easy for software programmer to port between differentdevices
• Nevertheless, despite of the default memory map, the actualusage of the memory map can also be flexibly defined by theuser, except some fixed memory addresses, such as internalprivate peripheral bus
3–2 ECE 5655/4655 Real-Time DSP
M4 Memory Map (cont.)
M4 Memory Map (cont.)
Priv
ate
peri
pher
als
e.g.
NV
IC, S
CS
Mai
nly
used
for
exte
rnal
per
iphe
rals
e.g.
SD c
ard
Mai
nly
used
for
exte
rnal
mem
orie
se.
g. ex
tern
al D
DR
, FLA
SH, L
CD
Mai
nly
used
for
on-c
hip
peri
pher
als
e.g.
AH
B, A
PB p
erip
hera
ls
Mai
nly
used
for
data
mem
ory
e.g.
on-c
hip
SRA
M, S
DR
AM
Mai
nly
used
for
prog
ram
cod
e e.
g. on
-chi
p FL
ASH
Vend
or s
peci
ficM
emor
y
Exte
rnal
dev
ice
Exte
rnal
RA
M
Peri
pher
als
SRA
M
Cod
e
0xFF
FFFF
FF
0xE0
0000
00Pr
ivat
e Pe
riph
eral
Bus
(PPB
)0x
DFF
FFFF
F
0xA
0000
000
0x9F
FFFF
FF
0x60
0000
000x
5FFF
FFFF
0x40
0000
000x
3FFF
FFFF
0x1F
FFFF
FF0x
2000
0000
0x00
0000
00
512M
B
512M
B
512M
B
1GB
1GB
512M
B0x
E00F
FFFF
0xE0
1000
00R
eser
ved
for
othe
r pu
rpos
esRO
M t
able
Exte
rnal
PPB
Embe
dded
trac
e m
acro
cell
Trac
e po
rt in
terf
ace
unit
Res
erve
d
Syst
em C
ontr
ol S
pace
, inc
ludi
ngN
este
d Ve
ctor
ed In
terr
upt
Con
trol
ler
(NV
IC)
Res
erve
dFe
tch
patc
h an
d br
eakp
oint
uni
t
Dat
a w
atch
poin
tan
d tr
ace
unit
Inst
rum
enta
tion
trac
e m
acro
cell
Exte
rnal
PPB
Inte
rnal
PPB
ECE 5655/4655 Real-Time DSP 3–3
Chapter 3 • Cortex-M4 Architecture and ASM Programming
M4 Memory Map (cont.)
• Code Region
– Primarily used to store program code
– Can also be used for data memory
– On-chip memory, such as on-chip FLASH
• SRAM Region
– Primarily used to store data, such as heaps and stacks
– Can also be used for program code
– On-chip memory; despite its name “SRAM”, the actualdevice could be SRAM, SDRAM or other types
• Peripheral Region
– Primarily used for peripherals, such as Advanced High-performance Bus (AHB) or Advanced Peripheral Bus(APB) peripherals
• External RAM Region
– Primarily used to store large data blocks, or memorycaches
– Off-chip memory, slower than on-chip SRAM region
• External Device Region
– Primarily used to map to external devices
– Off-chip devices, such as SD card
• Internal Private Peripheral Bus (PPB)
3–4 ECE 5655/4655 Real-Time DSP
Cortex-M4 Memory Map Example
– Used inside the processor core for internal control
– Within PPB, a special range of memory is defined as Sys-tem Control Space (SCS)
– The Nested Vectored Interrupt Controller (NVIC) is part ofSCS
Cortex-M4 Memory Map Example
AHB bus
External SRAM,FLASH External LCD SD card
Cortex-M4 PPB SCS NVICDebug Ctrl
On-chip FLASH(Code Region)
On-chip SRAM(SRAM Region) Peripheral Region
External memory interface(External RAM Region)
External device interface(External Device Region)
Timer UART GPIO
Chip Silicon
ECE 5655/4655 Real-Time DSP 3–5
Chapter 3 • Cortex-M4 Architecture and ASM Programming
Bit-band Operations
• Bit-band operation allows a single load/store operation toaccess a single bit in the memory, for example, to change asingle bit of one 32-bit data:
– Normal operation without bit-band (read-modify-write)
– Read the value of 32-bit data
– Modify a single bit of the 32-bit value (keep other bitsunchanged)
– Write the value back to the address
– Bit-band operation
– Directly write a single bit (0 or 1) to the “bit-band aliasaddress” of the data
• Bit-band alias address
– Each bit-band alias address is mapped to a real dataaddress
– When writing to the bit-band alias address, only a singlebit of the data will be changed
3–6 ECE 5655/4655 Real-Time DSP
Bit-band Operation Example
Bit-band Operation Example
• For example, in order to set bit[3] in word data in address0x20000000:
• Read-Modify-Write operation
– Read the real data address (0x20000000)
– Modify the desired bit (retain other bits unchanged)
– Write the modified data back
• Bit-band operation
– Directly set the bit by writing ‘1’ to address 0x2200000C,which is the alias address of the fourth bit of the 32-bitdata at 0x20000000
– In effect, this single instruction is mapped to 2 bus trans-fers: read data from 0x20000000 to the buffer, and thenwrite to 0x20000000 from the buffer with bit [3] set
;Read-Modify-Write Operation
LDR R1, =0x20000000 ;Setup addressLDR R0, [R1] ;ReadORR.W R0, #0x8 ;Modify bitSTR R0, [R1] ;Write back
;Bit-band Operation
LDR R1, =0x2200000C ;Setup addressMOV R0, #1 ;Load dataSTR R0, [R1] ;Write
ECE 5655/4655 Real-Time DSP 3–7
Chapter 3 • Cortex-M4 Architecture and ASM Programming
Bit-band Alias Address
Each bit of the 32-bit data is one-to-one mapped to the bit-bandalias address
– For example, the fourth bit (bit [3]) of the data at0x20000000 is mapped to the bit-band alias address at0x2200000C
– Hence, to set bit [3] of the data at 0x20000000, we onlyneed to write ‘1’ to address 0x2200000C
– In Cortex-M4, there are two pre-defined bit-band aliasregions: one for SRAM region, and one for peripheralsregion
0x20000000
0x20000004
0x20000008
Real 32-bit data address
0x22000000
0x22000080
0x22000100
0x2200000C
0x22000018
Bit-band alias address
3–8 ECE 5655/4655 Real-Time DSP
Bit-band Alias Address (cont.)
Bit-band Alias Address (cont.)
• SRAM region
– 32MB memory space (0x22000000 – 0x23FFFFFF) isused as the bit-band alias region for 1MB data(0x20000000 – 0x200FFFFF)
• Peripherals region
– 32MB memory space (0x42000000 – 0x43FFFFFF) isused as the bit-band alias region for 1MB data(0x40000000 – 0x400FFFFF)
External RAM
Peripherals
SRAM
Code
0x600000000x5FFFFFFF
0x400000000x3FFFFFFF
0x1FFFFFFF0x20000000
0x00000000
512MB
512MB
512MB
1MB Bit-band region
32MB Bit-band alias
31MB non-bit-band region
0x20000000
0x20100000
0x22000000
0x23FFFFFF
0x21FFFFFF
1MB Bit-band region
32MB Bit-band alias
31MB non-bit-band region
0x40000000
0x40100000
0x42000000
0x43FFFFFF
0x41FFFFFF
ECE 5655/4655 Real-Time DSP 3–9
Chapter 3 • Cortex-M4 Architecture and ASM Programming
Benefits of Bit-Band Operations
• Faster bit operations
• Fewer instructions
• Atomic operation, avoid hazards
– For example, if an interrupt is triggered and served duringthe Read-Modify-Write operations, and the interrupt ser-vice routine modifies the same data, a data conflict willoccur
Interrupt occurs
Read data at 0x00 Modify bit [1]
Read data at 0x00 Modify bit [1] Write data back
Write data back
Interrupt returns
Bit [1] modified by ISR is overwritten by the main program
Interrupt Service Routine
Main program
3–10 ECE 5655/4655 Real-Time DSP
Cortex-M4 Program Image
Cortex-M4 Program Image
• The program image in Cortex-M4 contains
– Vector table -- includes the starting addresses of exceptions(vectors) and the value of the main stack point (MSP);
– C start-up routine;
– Program code – application code and data;
– C library code – program codes for C library functions
0x00000000
Code region
Start-up routine &Program code &C library code
Vector table
ProgramImage
Initial MSP valueReset vectorNMI vector
Hard fault vectorMemManage fault
ReservedPendSVSysTick
External Interrupts
Bus faultUsage fault
Reserved
SVCallDebug monitor
ECE 5655/4655 Real-Time DSP 3–11
Chapter 3 • Cortex-M4 Architecture and ASM Programming
Cortex-M4 Program Image (cont)
• After Reset, the processor:
– First reads the initial MSP value;
– Then reads the reset vector;
– Branches to the start of the programme execution address(reset handler);
– Subsequently executes program instructions
Reset
Fetch initial value for MSP(Read address 0x00000000)
Fetch reset vector(Read address 0x00000004)
Fetch 1st instruction(Read address of reset vector)
Fetch 2nd instruction(Read subsequent instructions)
3–12 ECE 5655/4655 Real-Time DSP
Cortex-M4 Endianness
Cortex-M4 Endianness
• Endian refers to the order of bytes stored in memory
– Little endian: lowest byte of a word-size data is stored inbit 0 to bit 7
– Big endian: lowest byte of a word-size data is stored in bit24 to bit 31
• Cortex-M4 supports both little endian and big endian
• However, Endianness only exists in the hardware level
Byte0 Byte1 Byte2 Byte3
Byte0 Byte1 Byte2 Byte3
Byte0 Byte1 Byte2 Byte3
[7:0][15:8][23:16][31:24]
Word 1
Word 2
Word 3
Big endian 32-bit memory
Byte0Byte1Byte2Byte3
Byte0Byte1Byte2Byte3
Byte0Byte1Byte2Byte3
0x00000000
0x00000004
0x00000008
Address [7:0][15:8][23:16][31:24]
Word 1
Word 2
Word 3
Little endian 32-bit memory
ECE 5655/4655 Real-Time DSP 3–13
Chapter 3 • Cortex-M4 Architecture and ASM Programming
ARM and Thumb® Instruction Set
• Early ARM instruction set
– 32-bit instruction set, called the ARM instructions
– Powerful and good performance
– Larger program memory compared to 8-bit and 16-bit pro-cessors
– Larger power consumption
• Thumb-1 instruction set
– 16-bit instruction set, first used in ARM7TDMI processorin 1995
– Provides a subset of the ARM instructions, giving bettercode density compared to 32-bit RISC architecture
– Code size is reduced by ~30%, but performance is alsoreduced by ~20%
3–14 ECE 5655/4655 Real-Time DSP
ARM and Thumb Instruction Set (cont.)
ARM and Thumb Instruction Set (cont.)
• Mix of ARM and Thumb-1 Instruction sets
– Benefit from both 32-bit ARM (high performance) and 16-bit Thumb-1 (high code density)
– A multiplexer is used to switch between two states: ARMstate (32-bit) and Thumb state (16-bit), which requires aswitching overhead
• Thumb-2 instruction set
• Consists of both 32-bit Thumb instructions and original 16-bit Thumb-1 instruction sets
• Compared to 32-bit ARM instructions set, code size isreduced by ~26%, while keeping a similar performance
• Capable of handling all processing requirements in one oper-ation state
IncomingInstructions
Thumb remapto ARM
ARMInstructiondecoder
InstructionsExecuting
T bit, 0: select ARM,1: select Thumb
0
1
ECE 5655/4655 Real-Time DSP 3–15
Chapter 3 • Cortex-M4 Architecture and ASM Programming
Cortex-M4 Instruction Set
• Cortex-M4 processor
– ARMv7-M architecture
– Supports 32-bit Thumb-2 instructions
– Possible to handle all processing requirements in one oper-ation state (Thumb state)
– Compared with traditional ARM processors (use stateswitching), advantages include:
* No state switching overhead – both execution time and instruc-tion space are saved
* No need to separate ARM code and Thumb code source files,which makes the development and maintenance of softwareeasier
* Easier to get optimized efficiency and performance
3–16 ECE 5655/4655 Real-Time DSP
Cortex-M4 Instruction Set (cont.)
Cortex-M4 Instruction Set (cont.)
• ARM assembly syntax:label
mnemonic operand1,operand2, …; Comments
– Label is used as a reference to an address location;
– Mnemonic is the name of the instruction;
– Operand1 is the destination of the operation;
– Operand2 is normally the source of the operation;
– Comments are written after “ ; ”, which does not affect theprogram;
– For exampleMOVSR3, #0x11;Set register R3 to 0x11
– Note that the assembly code can be assembled by eitherARM assembler (armasm) or assembly tools from a vari-ety of vendors (e.g. GNU tool chain). When using GNUtool chain, the syntax for labels and comments is slightlydifferent
ECE 5655/4655 Real-Time DSP 3–17
Chapter 3 • Cortex-M4 Architecture and ASM Programming
Cortex-M4 Instruction Set Tables
Mne
mon
icO
pera
nds
Bri
ef d
escr
ipti
onFl
ags
AD
C, A
DC
S{R
d,} R
n, O
p2A
dd w
ith C
arry
N,Z
,C,V
AD
D, A
DD
S{R
d,} R
n, O
p2A
ddN
,Z,C
,V
AD
D, A
DD
W{R
d,} R
n, #
imm
12A
ddN
,Z,C
,V
AD
RR
d, la
bel
Load
PC
-rel
ativ
e A
ddre
ss
AN
D, A
ND
S{R
d,} R
n, O
p2Lo
gica
l AN
DN
,Z,C
ASR
, ASR
SR
d, R
m, <
Rs|
#n>
Ari
thm
etic
Shi
ft R
ight
N,Z
,C
Bla
bel
Bran
ch
BFC
Rd,
#ls
b, #
wid
thBi
t Fie
ld C
lear
BFI
Rd,
Rn,
#ls
b, #
wid
thBi
t Fie
ld In
sert
BIC
, BIC
S{R
d,} R
n, O
p2Bi
t Cle
arN
,Z,C
BKPT
#im
mBr
eakp
oint
BLla
bel
Bran
ch w
ith L
ink
BLX
Rm
Bran
ch in
dire
ct w
ith L
ink
BXR
mBr
anch
indi
rect
3–18 ECE 5655/4655 Real-Time DSP
Cortex-M4 Instruction Set Tables
Cortex-M4 Instruction Set Tables (cont.)
Mne
mon
icO
pera
nds
Bri
ef d
escr
ipti
onFl
ags
CBN
ZR
n, la
bel
Com
pare
and
Bra
nch
if N
on Z
ero
CBZ
Rn,
labe
lC
ompa
re a
nd B
ranc
h if
Zer
o
CLR
EXC
lear
Exc
lusi
ve
CLZ
Rd,
Rm
Cou
nt L
eadi
ng Z
eros
CM
NR
n, O
p2C
ompa
re N
egat
ive
N,Z
,C,V
CM
PR
n, O
p2C
ompa
reN
,Z,C
,V
CPS
IDi
Cha
nge
Proc
esso
r St
ate,
Dis
able
Inte
rrup
ts
CPS
IEi
Cha
nge
Proc
esso
r St
ate,
Ena
ble
Inte
rrup
ts
DM
BD
ata
Mem
ory
Barr
ier
DSB
Dat
a Sy
nchr
oniz
atio
n Ba
rrie
r
EOR
, EO
RS
{Rd,
} R
n, O
p2Ex
clus
ive
OR
N,Z
,C
ISB
-In
stru
ctio
n Sy
nchr
oniz
atio
n Ba
rrie
r
ECE 5655/4655 Real-Time DSP 3–19
Chapter 3 • Cortex-M4 Architecture and ASM Programming
Cortex-M4 Instruction Set Tables (cont.)
Mne
mon
icO
pera
nds
Bri
ef d
escr
ipti
onFl
ags
ITIf-
The
n co
nditi
on b
lock
LDM
Rn{
!}, r
eglis
tLo
ad M
ultip
le r
egis
ters
, inc
rem
ent a
fter
LDM
DB,
LD
MEA
Rn{
!}, r
eglis
tLo
ad M
ultip
le r
egis
ters
, dec
rem
ent b
efor
e
LDM
FD, L
DM
IAR
n{!}
, reg
list
Load
Mul
tiple
reg
iste
rs, i
ncre
men
t afte
r
LDR
Rt,
[Rn,
#of
fset
]Lo
ad R
egis
ter
with
wor
d
LDR
B, L
DR
BTR
t, [R
n, #
offs
et]
Load
Reg
iste
r w
ith b
yte
LDR
DR
t, R
t2, [
Rn,
#of
fset
]Lo
ad R
egis
ter
with
tw
o by
tes
LDR
EXR
t, [R
n, #
offs
et]
Load
Reg
iste
r Ex
clus
ive
LDR
EXB
Rt,
[Rn]
Load
Reg
iste
r Ex
clus
ive
with
Byt
e
LDR
EXH
Rt,
[Rn]
Load
Reg
iste
r Ex
clus
ive
with
Hal
fwor
d
LDR
H, L
DR
HT
Rt,
[Rn,
#of
fset
]Lo
ad R
egis
ter
with
Hal
fwor
d
3–20 ECE 5655/4655 Real-Time DSP
Cortex-M4 Instruction Set Tables
Cortex-M4 Instruction Set Tables (cont.)
Mne
mon
icO
pera
nds
Bri
ef d
escr
ipti
onFl
ags
LDR
SB, L
DR
SBT
Rt,
[Rn,
#of
fset
]Lo
ad R
egis
ter
with
Sig
ned
Byte
LDR
SH, L
DR
SHT
Rt,
[Rn,
#of
fset
]Lo
ad R
egis
ter
with
Sig
ned
Hal
fwor
d
LDRT
Rt,
[Rn,
#of
fset
]Lo
ad R
egis
ter
with
wor
d
LSL,
LSL
SR
d, R
m, <
Rs|
#n>
Logi
cal S
hift
Left
N,Z
,C
LSR
, LSR
SR
d, R
m, <
Rs|
#n>
Logi
cal S
hift
Rig
htN
,Z,C
MLA
Rd,
Rn,
Rm
, Ra
Mul
tiply
with
Acc
umul
ate,
32-
bit r
esul
t
MLS
Rd,
Rn,
Rm
, Ra
Mul
tiply
and
Sub
trac
t, 32
-bit
resu
lt
MO
V, M
OV
SR
d, O
p2M
ove
N,Z
,C
MO
VT
Rd,
#im
m16
Mov
e To
p
MO
VW
, MO
VR
d, #
imm
16M
ove
16-b
it co
nsta
ntN
,Z,C
MR
SR
d, s
pec_
reg
Mov
e fr
om S
peci
al R
egis
ter
to g
ener
al r
egis
ter
MSR
spec
_reg
, Rm
Mov
e fr
om g
ener
al r
egis
ter
to S
peci
al R
egis
ter
N,Z
,C,V
ECE 5655/4655 Real-Time DSP 3–21
Chapter 3 • Cortex-M4 Architecture and ASM Programming
Cortex-M4 Instruction Set Tables (cont.)
Mne
mon
icO
pera
nds
Bri
ef d
escr
ipti
onFl
ags
MU
L, M
ULS
{Rd,
} Rn,
Rm
Mul
tiply,
32-
bit r
esul
tN
,Z
MV
N, M
VN
SR
d, O
p2M
ove
NO
TN
,Z,C
NO
PN
o O
pera
tion
OR
N, O
RN
S{R
d,} R
n, O
p2Lo
gica
l OR
NO
TN
,Z,C
OR
R, O
RR
S{R
d,} R
n, O
p2Lo
gica
l OR
N,Z
,C
PKH
TB,
PKH
BT{R
d, }
Rn,
Rm
, Op2
Pack
Hal
fwor
d
POP
regl
ist
Pop
regi
ster
s fr
om s
tack
PUSH
regl
ist
Push
reg
iste
rs o
nto
stac
k
QA
DD
{Rd,
} R
n, R
mSa
tura
ting
doub
le a
nd A
ddQ
QA
DD
16{R
d, }
Rn,
Rm
Satu
ratin
gAdd
16
QA
DD
8{R
d, }
Rn,
Rm
Satu
ratin
gAdd
8
3–22 ECE 5655/4655 Real-Time DSP
Cortex-M4 Instruction Set Tables
Cortex-M4 Instruction Set Tables (cont.)
Mne
mon
icO
pera
nds
Bri
ef d
escr
ipti
onFl
ags
QA
SX{R
d, }
Rn,
Rm
Sa
tura
ting A
dd a
nd S
ubtr
act w
ith E
xcha
nge
QD
AD
D{R
d, }
Rn,
Rm
Satu
ratin
g Add
Q
QD
SUB
{Rd,
} R
n, R
mSa
tura
ting
doub
le a
nd S
ubtr
act
Q
QSA
X{R
d, }
Rn,
Rm
Sa
tura
ting
Subt
ract
and
Add
with
Exc
hang
e
QSU
B{R
d,}
Rn,
Rm
Satu
ratin
g Su
btra
ctQ
QSU
B16
{Rd,
} R
n, R
mSa
tura
ting
Subt
ract
16
QSU
B8{R
d, }
Rn,
Rm
Satu
ratin
g Su
btra
ct 8
RBI
TR
d, R
nR
ever
se B
its
REV
Rd,
Rn
Rev
erse
byt
e or
der
in a
wor
d
REV
16R
d, R
nR
ever
se b
yte
orde
r in
eac
h ha
lfwor
d
REV
SHR
d, R
nR
ever
se b
yte
orde
r in
bot
tom
hal
fwor
dan
d si
gn e
xten
d
ROR
, RO
RS
Rd,
Rm
, <R
s|#n
>R
otat
e R
ight
N,Z
,C
ECE 5655/4655 Real-Time DSP 3–23
Chapter 3 • Cortex-M4 Architecture and ASM Programming
Cortex-M4 Instruction Set Tables (cont.)
Mne
mon
icO
pera
nds
Bri
ef d
escr
ipti
onFl
ags
RR
X, R
RX
SR
d, R
mR
otat
e R
ight
with
Ext
end
N,Z
,C
RSB
, RSB
S{R
d,} R
n, O
p2R
ever
se S
ubtr
act
N,Z
,C,V
SAD
D16
{Rd,
} R
n, R
mSi
gned
Add
16
GE
SAD
D8
{Rd,
} R
n, R
mSi
gned
Add
8G
E
SASX
{Rd,
}R
n, R
mSi
gned
Add
and
Sub
trac
t with
Exc
hang
eG
E
SBC
, SBC
S{R
d,} R
n, O
p2Su
btra
ct w
ith C
arry
N,Z
,C,V
SBFX
Rd,
Rn,
#ls
b, #
wid
thSi
gned
Bit
Fiel
d Ex
trac
t
SDIV
{Rd,
} Rn,
Rm
Sign
ed D
ivid
e
SEV
Send
Eve
nt
SHA
DD
16{R
d,} R
n, R
mSi
gned
Hal
ving
Add
16
SHA
DD
8{R
d,} R
n, R
mSi
gned
Hal
ving
Add
8
SHA
SX{R
d,} R
n, R
mSi
gned
Hal
ving
Add
and
Sub
trac
t with
Exc
hang
e
3–24 ECE 5655/4655 Real-Time DSP
Cortex-M4 Instruction Set Tables
Cortex-M4 Instruction Set Tables (cont.)
Mne
mon
icO
pera
nds
Bri
ef d
escr
ipti
onFl
ags
SHSA
X{R
d,}
Rn,
Rm
Sign
ed H
alvi
ng S
ubtr
act a
nd A
dd w
ith E
xcha
nge
SHSU
B16
{Rd,
} Rn,
Rm
Sign
ed H
alvi
ng S
ubtr
act 1
6
SHSU
B8{R
d,}
Rn,
Rm
Sign
ed H
alvi
ng S
ubtr
act 8
SMLA
BB, S
MLA
BT, S
MLA
TB,
SM
LAT
TR
d, R
n, R
m, R
aSi
gned
Mul
tiply
Acc
umul
ate
Long
(hal
fwor
ds)
Q
SMLA
D, S
MLA
DX
Rd,
Rn,
Rm
, Ra
Sign
ed M
ultip
ly A
ccum
ulat
e D
ual
Q
SMLA
LR
dLo,
RdH
i, R
n, R
mSi
gned
Mul
tiply
with
Acc
umul
ate
(32
x 32
+ 6
4), 6
4-bi
t re
sult
SMLA
LBB,
SM
LALB
T, SM
LALT
B,
SMLA
LTT
RdL
o, R
dHi,
Rn,
Rm
Sign
ed M
ultip
ly A
ccum
ulat
e Lo
ng, h
alfw
ords
SMLA
LD, S
MLA
LDX
RdL
o, R
dHi,
Rn,
Rm
Sign
ed M
ultip
ly A
ccum
ulat
e Lo
ng D
ual
SMLA
WB,
SM
LAW
TR
d, R
n, R
m, R
aSi
gned
Mul
tiply
Acc
umul
ate,
wor
d by
hal
fwor
dQ
SMLS
DR
d, R
n, R
m, R
aSi
gned
Mul
tiply
Sub
trac
t Dua
lQ
SMLS
LDR
dLo,
RdH
i, R
n, R
mSi
gned
Mul
tiply
Sub
trac
t Lon
g D
ual
SMM
LAR
d, R
n, R
m, R
aSi
gned
Mos
t sig
nific
ant w
ord
Mul
tiply
Acc
umul
ate
ECE 5655/4655 Real-Time DSP 3–25
Chapter 3 • Cortex-M4 Architecture and ASM Programming
Cortex-M4 Instruction Set Tables (cont.)
Mne
mon
icO
pera
nds
Bri
ef d
escr
ipti
onFl
ags
SMM
LS, S
MM
LRR
d, R
n, R
m, R
aSi
gned
Mos
t sig
nific
ant w
ord
Mul
tiply
Sub
trac
t
SMM
UL,
SM
MU
LR{R
d,} R
n, R
mSi
gned
Mos
t sig
nific
ant w
ord
Mul
tiply
SMU
AD
{Rd,
} Rn,
Rm
Sign
ed d
ual M
ultip
ly A
ddQ
SMU
LBB,
SM
ULB
T S
MU
LTB,
SM
ULT
T{R
d,} R
n, R
mSi
gned
Mul
tiply
(hal
fwor
ds)
SMU
LLR
dLo,
RdH
i, R
n, R
mSi
gned
Mul
tiply
(32
x 32
), 64
-bit
resu
lt
SMU
LWB,
SM
ULW
T{R
d,} R
n, R
mSi
gned
Mul
tiply
wor
d by
hal
fwor
d
SMU
SD, S
MU
SDX
{Rd,
} Rn,
Rm
Sign
ed d
ual M
ultip
ly S
ubtr
act
SSAT
Rd,
#n,
Rm
{,shi
ft #s
}Si
gned
Sat
urat
eQ
SSAT
16R
d, #
n, R
mSi
gned
Sat
urat
e 16
Q
SSA
X{R
d,} R
n, R
mSi
gned
Sub
trac
t and
Add
with
Exc
hang
eG
E
SSU
B16
{Rd,
} Rn,
Rm
Sign
ed S
ubtr
act 1
6
SSU
B8{R
d,} R
n, R
mSi
gned
Sub
trac
t 8
3–26 ECE 5655/4655 Real-Time DSP
Cortex-M4 Instruction Set Tables
Cortex-M4 Instruction Set Tables (cont.)
Mne
mon
icO
pera
nds
Bri
ef d
escr
ipti
onFl
ags
STM
Rn{
!}, r
eglis
tSt
ore
Mul
tiple
reg
iste
rs, i
ncre
men
t afte
r
STM
DB,
ST
MEA
Rn{
!}, r
eglis
tSt
ore
Mul
tiple
reg
iste
rs, d
ecre
men
t bef
ore
STM
FD, S
TM
IAR
n{!}
, reg
list
Stor
e M
ultip
le r
egis
ters
, inc
rem
ent a
fter
STR
Rt,
[Rn,
#of
fset
]St
ore
Reg
iste
r w
ord
STR
B, S
TR
BTR
t, [R
n, #
offs
et]
Stor
e R
egis
ter
byte
STR
DR
t, R
t2, [
Rn,
#of
fset
]St
ore
Reg
iste
r tw
o w
ords
STR
EXR
d, R
t, [R
n, #
offs
et]
Stor
e R
egis
ter
Excl
usiv
e
STR
EXB
Rd,
Rt,
[Rn]
Stor
e R
egis
ter
Excl
usiv
e By
te
STR
EXH
Rd,
Rt,
[Rn]
Stor
e R
egis
ter
Excl
usiv
e H
alfw
ord
STR
H, S
TR
HT
Rt,
[Rn,
#of
fset
]St
ore
Reg
iste
r H
alfw
ord
STRT
Rt,
[Rn,
#of
fset
]St
ore
Reg
iste
r w
ord
SUB,
SU
BS{R
d,} R
n, O
p2Su
btra
ctN
,Z,C
,V
ECE 5655/4655 Real-Time DSP 3–27
Chapter 3 • Cortex-M4 Architecture and ASM Programming
Cortex-M4 Instruction Set Tables (cont.)
Mne
mon
icO
pera
nds
Bri
ef d
escr
ipti
onFl
ags
SUB,
SU
BW{R
d,} R
n, #
imm
12Su
btra
ctN
,Z,C
,V
SVC
#im
mSu
perv
isor
Cal
l
SXTA
B{R
d,}
Rn,
Rm
,{,RO
R #
}Ex
tend
8 b
its t
o 32
and
add
SXTA
B16
{Rd,
} R
n, R
m,{,
ROR
#}
Dua
l ext
end
8 bi
ts t
o 16
and
add
SXTA
H{R
d,}
Rn,
Rm
,{,RO
R #
}Ex
tend
16
bits
to 3
2 an
d ad
d
SXT
B16
{Rd,
} Rm
{,RO
R #
n}Si
gned
Ext
end
Byte
16
SXT
B{R
d,} R
m{,R
OR
#n}
Sign
ext
end
a by
te
SXT
H{R
d,} R
m{,R
OR
#n}
Sign
ext
end
a ha
lfwor
d
TBB
[Rn,
Rm
]Ta
ble
Bran
ch B
yte
TBH
[Rn,
Rm
, LSL
#1]
Tabl
e Br
anch
Hal
fwor
d
TEQ
Rn,
Op2
Test
Equ
ival
ence
N,Z
,C
TST
Rn,
Op2
Test
N,Z
,C
3–28 ECE 5655/4655 Real-Time DSP
Cortex-M4 Instruction Set Tables
Cortex-M4 Instruction Set Tables (cont.)
Mne
mon
icO
pera
nds
Bri
ef d
escr
ipti
onFl
ags
UA
DD
16{R
d,} R
n, R
mU
nsig
ned
Add
16
GE
UA
DD
8{R
d,} R
n, R
mU
nsig
ned
Add
8G
E
USA
X{R
d,}
Rn,
Rm
Uns
igne
d Su
btra
ct a
nd A
dd w
ith E
xcha
nge
GE
UH
AD
D16
{Rd,
} Rn,
Rm
Uns
igne
d H
alvi
ng A
dd 1
6
UH
AD
D8
{Rd,
} Rn,
Rm
Uns
igne
d H
alvi
ng A
dd 8
UH
ASX
{Rd,
} Rn,
Rm
Uns
igne
d H
alvi
ng A
dd a
nd S
ubtr
act w
ith E
xcha
nge
UH
SAX
{Rd,
} Rn,
Rm
Uns
igne
d H
alvi
ng S
ubtr
act a
nd A
dd w
ith E
xcha
nge
UH
SUB1
6{R
d,} R
n, R
mU
nsig
ned
Hal
ving
Sub
trac
t 16
UH
SUB8
{Rd,
} Rn,
Rm
Uns
igne
d H
alvi
ng S
ubtr
act 8
UBF
XR
d, R
n, #
lsb,
#w
idth
Uns
igne
d Bi
t Fie
ld E
xtra
ct
UD
IV{R
d,} R
n, R
mU
nsig
ned
Div
ide
UM
AA
LR
dLo,
RdH
i, R
n, R
mU
nsig
ned
Mul
tiply
Acc
umul
ate
Acc
umul
ate
Long
(32
x 32
+
32 +
32),
64-b
it re
sult
ECE 5655/4655 Real-Time DSP 3–29
Chapter 3 • Cortex-M4 Architecture and ASM Programming
Cortex-M4 Instruction Set Tables (cont.)
Mne
mon
icO
pera
nds
Bri
ef d
escr
ipti
onFl
ags
UM
LAL
RdL
o, R
dHi,
Rn,
Rm
Uns
igne
d M
ultip
ly w
ith A
ccum
ulat
e (3
2 x
32 +
64)
, 64-
bit
resu
lt
UM
ULL
RdL
o, R
dHi,
Rn,
Rm
Uns
igne
d M
ultip
ly (3
2 x
32),
64-b
it re
sult
UQ
AD
D16
{Rd,
} Rn,
Rm
Uns
igne
d Sa
tura
ting A
dd 1
6
UQ
AD
D8
{Rd,
} Rn,
Rm
Uns
igne
d Sa
tura
ting A
dd 8
UQ
ASX
{Rd,
} Rn,
Rm
Uns
igne
d Sa
tura
ting A
dd a
nd S
ubtr
act w
ith E
xcha
nge
UQ
SAX
{Rd,
} R
n, R
mU
nsig
ned
Satu
ratin
g Su
btra
ct a
nd A
dd w
ith E
xcha
nge
UQ
SUB1
6{R
d,} R
n, R
mU
nsig
ned
Satu
ratin
g Su
btra
ct 1
6
UQ
SUB8
{Rd,
} Rn,
Rm
Uns
igne
d Sa
tura
ting
Subt
ract
8
USA
D8
{Rd,
} Rn,
Rm
Uns
igne
d Su
m o
f Abs
olut
e D
iffer
ence
s
USA
DA
8{R
d,}
Rn,
Rm
, Ra
Uns
igne
d Su
m o
f Abs
olut
e D
iffer
ence
s an
d A
ccum
ulat
e
USA
TR
d, #
n, R
m{,s
hift
#s}
Uns
igne
d Sa
tura
teQ
USA
T16
Rd,
#n,
Rm
Uns
igne
d Sa
tura
te 1
6Q
3–30 ECE 5655/4655 Real-Time DSP
Cortex-M4 Instruction Set Tables
Cortex-M4 Instruction Set Tables (cont.)
Mne
mon
icO
pera
nds
Bri
ef d
escr
ipti
onFl
ags
UA
SX{R
d,} R
n, R
mU
nsig
ned
Add
and
Sub
trac
t with
Exc
hang
eG
E
USU
B16
{Rd,
} Rn,
Rm
Uns
igne
d Su
btra
ct 1
6G
E
USU
B8{R
d,} R
n, R
mU
nsig
ned
Subt
ract
8G
E
UX
TAB
{Rd,
} Rn,
Rm
,{,RO
R #
}R
otat
e, e
xten
d 8
bits
to 3
2 an
d A
dd
UX
TAB1
6{R
d,} R
n, R
m,{,
ROR
#}
Rot
ate,
dua
l ext
end
8 bi
ts t
o 16
and
Add
UX
TAH
{Rd,
} Rn,
Rm
,{,RO
R #
}R
otat
e, u
nsig
ned
exte
nd a
nd A
dd H
alfw
ord
UX
TB
{Rd,
} Rm
{,RO
R #
n}Z
ero
exte
nd a
Byt
e
UX
TB1
6{R
d,} R
m {,
ROR
#n}
Uns
igne
d Ex
tend
Byt
e 16
UX
TH
{Rd,
} Rm
{,RO
R #
n}Z
ero
exte
nd a
Hal
fwor
d
VABS
.F32
Sd, S
mFl
oatin
g-po
int A
bsol
ute
VAD
D.F
32{S
d,} S
n, S
mFl
oatin
g-po
int A
dd
VC
MP.F
32Sd
, <Sm
| #0
.0>
Com
pare
two
float
ing-
poin
t reg
iste
rs, o
r on
e flo
atin
g-po
int r
egis
ter
and
zero
FPSC
R
ECE 5655/4655 Real-Time DSP 3–31
Chapter 3 • Cortex-M4 Architecture and ASM Programming
Cortex-M4 Instruction Set Tables (cont.)
Mne
mon
icO
pera
nds
Bri
ef d
escr
ipti
onFl
ags
VC
MPE
.F32
Sd, <
Sm |
#0.0
>C
ompa
re tw
o flo
atin
g-po
int r
egis
ters
, or
one
float
ing-
poin
t reg
iste
r an
d ze
ro w
ith In
valid
Ope
ratio
n ch
eck
FPSC
R
VC
VT.
S32.
F32
Sd, S
mC
onve
rt b
etw
een
float
ing-
poin
t and
inte
ger
VC
VT.
S16.
F32
Sd, S
d, #
fbits
Con
vert
bet
wee
n flo
atin
g-po
int a
nd fi
xed
poin
t
VC
VT
R.S
32.F
32Sd
, Sm
Con
vert
bet
wee
n flo
atin
g-po
int a
nd in
tege
r w
ith
roun
ding
VC
VT
<B|H
>.F3
2.F1
6Sd
, Sm
Con
vert
s ha
lf-pr
ecis
ion
valu
e to
sin
gle-
prec
isio
n
VC
VT
T<B
|T>.
F32.
F16
Sd, S
mC
onve
rts
sing
le-p
reci
sion
reg
iste
r to
hal
f-pre
cisi
on
VD
IV.F
32{S
d,} S
n, S
mFl
oatin
g-po
int D
ivid
e
VFM
A.F
32{S
d,} S
n, S
mFl
oatin
g-po
int F
used
Mul
tiply
Acc
umul
ate
VFN
MA
.F32
{Sd,
} Sn,
Sm
Floa
ting-
poin
t Fus
ed N
egat
e M
ultip
ly A
ccum
ulat
e
VFM
S.F3
2{S
d,} S
n, S
mFl
oatin
g-po
int F
used
Mul
tiply
Sub
trac
t
VFN
MS.
F32
{Sd,
} Sn,
Sm
Floa
ting-
poin
t Fus
ed N
egat
e M
ultip
ly S
ubtr
act
VLD
M.F
<32|
64>
Rn{
!}, li
stLo
ad M
ultip
le e
xten
sion
reg
iste
rs
3–32 ECE 5655/4655 Real-Time DSP
Cortex-M4 Instruction Set Tables
Cortex-M4 Instruction Set Tables (cont.)
Mne
mon
icO
pera
nds
Bri
ef d
escr
ipti
onFl
ags
VLD
R.F
<32|
64>
<Dd|
Sd>,
[Rn]
Load
an
exte
nsio
n re
gist
er fr
om m
emor
y
VLM
A.F
32{S
d,} S
n, S
mFl
oatin
g-po
int M
ultip
ly A
ccum
ulat
e
VLM
S.F3
2{S
d,} S
n, S
mFl
oatin
g-po
int M
ultip
ly S
ubtr
act
VM
OV.
F32
Sd, #
imm
Floa
ting-
poin
t Mov
e im
med
iate
VM
OV
Sd, S
mFl
oatin
g-po
int M
ove
regi
ster
VM
OV
Sn, R
tC
opy A
RM
cor
e re
gist
er t
o si
ngle
pre
cisi
on
VM
OV
Sm, S
m1,
Rt,
Rt2
Cop
y 2
AR
M c
ore
regi
ster
s to
2 s
ingl
e pr
ecis
ion
VM
OV
Dd[
x], R
tC
opy A
RM
cor
e re
gist
er t
o sc
alar
VM
OV
Rt,
Dn[
x]C
opy
scal
ar t
o A
RM
cor
e re
gist
er
VM
RS
Rt,
FPSC
RM
ove
FPSC
R to
AR
M c
ore
regi
ster
or A
PSR
N,Z
,C,V
VM
SRFP
SCR
, Rt
Mov
e to
FPS
CR
from
AR
M C
ore
regi
ster
FPSC
R
VM
UL.
F32
{Sd,
} Sn,
Sm
Floa
ting-
poin
t Mul
tiply
ECE 5655/4655 Real-Time DSP 3–33
Chapter 3 • Cortex-M4 Architecture and ASM Programming
Cortex-M4 Instruction Set Tables (cont.)
Mne
mon
icO
pera
nds
Bri
ef d
escr
ipti
onFl
ags
VN
EG.F
32Sd
, Sm
Floa
ting-
poin
t Neg
ate
VN
MLA
.F32
Sd, S
n, S
mFl
oatin
g-po
int M
ultip
ly a
nd A
dd
VN
MLS
.F32
Sd, S
n, S
mFl
oatin
g-po
int M
ultip
ly a
nd S
ubtr
act
VN
MU
L{S
d,} S
n, S
mFl
oatin
g-po
int M
ultip
ly
VPO
Plis
tPo
p ex
tens
ion
regi
ster
s
VPU
SHlis
tPu
sh e
xten
sion
reg
iste
rs
VSQ
RT.F
32Sd
, Sm
Cal
cula
tes
float
ing-
poin
t Squ
are
Roo
t
VST
MR
n{!}
, list
Floa
ting-
poin
t reg
iste
r St
ore
Mul
tiple
VST
R.F
<32|
64>
Sd, [
Rn]
Stor
es a
n ex
tens
ion
regi
ster
to
mem
ory
VSU
B.F<
32|6
4>{S
d,} S
n, S
mFl
oatin
g-po
int S
ubtr
act
WFE
Wai
t Fo
r Ev
ent
WFI
Wai
t Fo
r In
terr
upt
No
te: f
ull e
xpla
natio
n o
f eac
h in
stru
ctio
n c
an b
e fo
und
in C
orte
x-M
4 D
evic
es’ G
ene
ric U
ser
Gui
de (
Re
f-4
)
3–34 ECE 5655/4655 Real-Time DSP
Cortex-M4 Instruction Set Tables
Cortex-M4 Instruction Set Tables (cont.)• Cortex-M4 Suffix
– Some instructions can be followed by suffixes to updateprocessor flags or execute the instruction on a certain con-dition
Suffi
x D
escr
ipti
onE
xam
ple
Exa
mpl
e ex
plan
atio
n
SU
pdat
e A
PSR
(flag
s)A
DD
SR
1,
#0x2
1A
dd0x
21 t
o R
1 an
d up
date
APS
R
EQ, N
E,C
S, C
C, M
I, PL
, VS,
VC, H
I, LS
, G
E, L
T, G
T, LE
Con
ditio
nex
ecut
ion
e.g.
EQ=
equa
l, N
E= n
ot e
qual
, LT=
less
tha
nBN
Ela
bel
Bran
ch t
o th
e la
bel i
f not
equ
al
ECE 5655/4655 Real-Time DSP 3–35
Chapter 3 • Cortex-M4 Architecture and ASM Programming
C Calling Assembly
For real-time DSP applications the most common scenarioinvolving assembly code writing, if needed at all, will be C call-ing assembly. In simple terms the rules are:
• Formally, the ARM Architecture Procedure Call Standard(AAPCS) defines:
– Which registers must be saved and restored
– How to call procedures
– How to return from procedures
3–36 ECE 5655/4655 Real-Time DSP
C Calling Assembly
AAPCS Register Use Conventions
• Make it easier to create modular, isolated and integrated code
• Scratch registers are not expected to be preserved uponreturning from a called subroutine
– This applies to r0–r3
• Preserved (“variable”) registers are expected to have theiroriginal values upon returning from a called subroutine
– This applies to r4–r8, r10–r11
– Use PUSH {r4,..} and POP {r4,...}
ECE 5655/4655 Real-Time DSP 3–37
Chapter 3 • Cortex-M4 Architecture and ASM Programming
AAPCS Core Register Use
Reg
iste
rSy
nony
mSp
ecia
lR
ole
in t
he p
roce
dure
cal
l sta
ndar
dr1
5PC
The
Pro
gram
Cou
nter
.r1
4LR
The
Lin
k R
egis
ter.
r13
SPT
he S
tack
Poi
nter
.r1
2IP
The
Intr
a-Pr
oced
ure-
call
scra
tch
regi
ster
.r1
1v8
Vari
able
-reg
ister
8.
r10
v7Va
riab
le-r
egist
er 7
.
r9v6
,SB,
TR
Plat
form
reg
iste
r. T
he m
eani
ng o
f thi
s re
gist
er is
def
ined
by
the
pla
tform
sta
ndar
d.r8
v5Va
riab
le-r
egist
er 5
.r7
v4Va
riab
le r
egis
ter
4.r6
v3Va
riab
le r
egis
ter
3.r5
v2Va
riab
le r
egis
ter
2.r4
v1Va
riab
le r
egis
ter
1.r3
a4A
rgum
ent
/ scr
atch
reg
iste
r 4.
r2a3
Arg
umen
t / s
crat
ch r
egis
ter
3.r1
a2A
rgum
ent
/ res
ult
/ scr
atch
reg
iste
r 2.
r0a1
Arg
umen
t / r
esul
t / s
crat
ch r
egis
ter
1.
Mus
t be
sav
ed, r
esto
red
by
calle
e-pr
oced
ure
it m
ay m
odify
th
em.
Cal
ling
subr
outi
ne e
xpec
ts
thes
e to
ret
ain
thei
r va
lue.
Mus
t be
sav
ed, r
esto
red
by
calle
e-pr
oced
ure
it m
ay m
odify
th
em.
Don
’t n
eed
to b
e sa
ved.
May
be
used
for
argu
men
ts, r
esul
ts, o
r te
mpo
rary
val
ues.
3–38 ECE 5655/4655 Real-Time DSP
Example: Vector Norm Squared
Example: Vector Norm Squared
In this example we will be computing the squared length of avector using 16-bit (int16_t) signed numbers. In mathematicalterms we are finding
(3.1)
where
(3.2)
is an -dimensional vector (column or row vector).
• The solution will be obtained in two different ways:
– Conventional C programming
– Cortex-M assembly
• Optimization is not a concern at this point
• The focus here is to see by way of a simple example, how tocall a C routine from C (obvious), and how to call an assem-bly routine from C
A2
An2
n 1=
N
=
A A1 AN=
N
ECE 5655/4655 Real-Time DSP 3–39
Chapter 3 • Cortex-M4 Architecture and ASM Programming
C Version
• We implement this simple routine in C using a declared vec-tor length N and vector contents in the array v
• The C source, which includes the called functionnorm_sq_c is given below:
/******************************************************
Vector norm-squared routine in C and Assembly
******************************************************/
#include "fm4_wm8731_init.h"#include "FM4_slider_interface.h"
// Macros from fm4_wm8731_init.c need to configure GPIO without starting // codec ISRs. PFR is port function setting register - 0 for GPIO, 1 for // peripheral function#define GET_PFR(pin_ofs) ((volatile unsigned char*) (PFR_BASE + pin_ofs))// PCR is port pull-up setting register - 0 for pull-up, 1 for no pull-up#define GET_PCR(pin_ofs) ((volatile unsigned char*) (PCR_BASE + pin_ofs))// PDDR is port direction setting register - 0 for GPIO in, 1 for GPIO out#define GET_DDR(pin_ofs) ((volatile unsigned char*) (DDR_BASE + pin_ofs))
// Create (instantiate) GUI slider data structurestruct FM4_slider_struct FM4_GUI;
// Norm squared in C prototypeint16_t norm_sq_c(int16_t* v, int16_t n);// ASM function prototypesextern uint32_t simple_sqrt(uint32_t x);extern int16_t norm_sq_asm(int16_t *x, int16_t n);
/*---------------------------------------------------------------------------- MAIN function *----------------------------------------------------------------------------*/int main(void){
int16_t x = 0;int16_t v[5] = {1,2,3,6,7};uint32_t zInt = 99;uint32_t zInt_sq;char message[50];
3–40 ECE 5655/4655 Real-Time DSP
Example: Vector Norm Squared
// Set up DIAGNOSTIC_PIN GPIO for timing of function calls// Follow approach used in fm4_wm8731_init(), but without starting ISR
bFM4_GPIO_ADE_AN00 = 0x00; // P10 DIAGNOSTIC_PIN *GET_PFR(DIAGNOSTIC_PIN) &= ~0u; // set pin function as GPIO *GET_DDR(DIAGNOSTIC_PIN) = 1u ; // set pin direction as output *GET_PCR(DIAGNOSTIC_PIN) &= ~0u; // set pin to have pull-up
// Initialize the slider interface by setting the baud rate (460800 or // 921600) and initial float values for each of the 6 slider parameters
init_slider_interface(&FM4_GUI,460800, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0);
// Norm squared experimentgpio_set(DIAGNOSTIC_PIN,HIGH); //pin = P10/A3x = norm_sq_c (v, 5);// call c functiongpio_set(DIAGNOSTIC_PIN,LOW);sprintf(message, "Norm squared C: The answer is %d\n", x);write_uart0(message);gpio_set(DIAGNOSTIC_PIN,HIGH); //pin = P10/A3x = norm_sq_asm (v, 5);// call assembly functiongpio_set(DIAGNOSTIC_PIN,LOW);sprintf(message, "Norm squared ASM: The answer is %d\n", x);write_uart0(message);
// uint16_t Square root experimentgpio_set( DIAGNOSTIC_PIN,HIGH); //pin = P10/A3zInt_sq = simple_sqrt(zInt);gpio_set(DIAGNOSTIC_PIN,LOW);sprintf(message, "uint16_t SQRT of %d is %d\n", zInt, zInt_sq);write_uart0(message);
while(1){// Update slider parameters//update_slider_parameters(&FM4_GUI);
}}
int16_t norm_sq_c(int16_t* v, int16_t n){
int16_t i;int16_t out = 0;for(i=0; i<n; i++){out += v[i]*v[i];
}return out;
ECE 5655/4655 Real-Time DSP 3–41
Chapter 3 • Cortex-M4 Architecture and ASM Programming
}
• Notice in this code we have configured three additional out-put pins for physical code timing
– P1B = D0 is used to time norm_sq_c
– P1C = D1 is used to time norm_sq_asm
• The expected answer is
• Physical code time and cycle count timing comparison withthe assembly version, is come up next
Assembly Version
• The assembly routine is the following:; File demo_asm.s
PRESERVE8 ; Preserve 8 byte stack alignment THUMB ; indicate THUMB code is used
AREA |.text|, CODE, READONLY;Start of the CODE areaEXPORT norm_sq_asm
norm_sq_asm FUNCTION; Input array address: R0; Number of elements: R1MOVS R2, R0 ; move the address in R0 to R2MOVS R0, #0 ; initialize the result
sum_loopLDRSH R3, [R2],#0x2; load int16_t value pointed to
; by R2 into R3, then incrementMLA R0, R3, R3, R0; sq & accum in one step (faster)SUBS R1, R1, #1; R1 = R1 - 1, decrement the countCMP R1, #0 ; compare to 0 and set Z registerBNE sum_loop; branch if compare not zeroBX LR ; return R0ENDFUNCEND ; End of file
1 4 9 36 49+ + + + 99=From Terminal
3–42 ECE 5655/4655 Real-Time DSP
Example: Vector Norm Squared
• From just the C source it is not obvious that the function pro-totype for norm_asm is actually an assembly routine
• The answer is again 99
Performance Comparison
• In the Keil IDE debugger we set break points around thefunction to be timed:
• We take measurements using the timers available in the lowerright status bar; alternatively count cycles using the states
From Terminal
ECE 5655/4655 Real-Time DSP 3–43
Chapter 3 • Cortex-M4 Architecture and ASM Programming
register:
• To interpret the cycle count as real time, consider the FM4running with a 200 MHz clock frequency or 5 ns clock period
• As a cross check physical timing is explored using the Ana-
norm_sq_c with -O0 norm_sq_c with -O3 norm_sq_asm
time = 380 ns time = 340 ns time = 220 ns
Reset when you arrive at thefunction call, then step overand read the elasped time
When debugging observe the registersand in particular note the change in statesbefore and after the function call is made
right-click to view
3–44 ECE 5655/4655 Real-Time DSP
Example: Vector Norm Squared
log Discovery logic analyzer:
• The physical time results are more consistent, but a bias isintroduced since time is required to set and reset the GPIOpin
• The ASM code is timed from Keil by setting the break pointsaround the GPIO pin set functions:
• The above result is much longer than the physically measuredresult; conclude that removing the GPIO bias is not obvious
C version (-O3)
C calling ASM version
470 ns vs 350 ns implies a speedup of ~25.5%
Time is 420 ns including pin set and reset, vs 220 ns without —> ~200 ns due to pins
ECE 5655/4655 Real-Time DSP 3–45
Chapter 3 • Cortex-M4 Architecture and ASM Programming
Example: Unsigned Integer Square Root1
• Added to main // uint16_t Square root experiment gpio_set(PF7, HIGH); //Pin PF7 = D2 zInt_sq = simple_sqrt(zInt); gpio_set(PF7, LOW); //Pin PF7 = D2 sprintf(message, "uint16_t SQRT of %d is %d\n", zInt, zInt_sq); write_uart0(message);...
• Added to asm ; in demo_asm.s
PRESERVE8 ; Preserve 8 byte stack alignment THUMB ; indicate THUMB code is used
AREA |.text|, CODE, READONLY; Start of the CODE areaEXPORT simple_sqrt
simple_sqrt FUNCTION; Input : R0; Output : R0 (square root result)MOVW R1, #0x8000 ; R1 = 0x00008000MOVS R2, #0 ; Initialize result
simple_sqrt_loopADDS R2, R2, R1 ; M = (M + N)MUL R3, R2, R2 ; R3 = M^2CMP R3, R0 ; If M^2 > InputIT HI ; Greater ThanSUBHI R2, R2, R1 ; M = (M - N)LSRS R1, R1, #1 ; N = N >> 1BNE simple_sqrt_loopMOV R0, R2 ; Copy to R0 and returnBX LR ; Return
ENDFUNC
1.Yiu Chapter 20, p. 664.
3–46 ECE 5655/4655 Real-Time DSP
Example: Unsigned Integer Square Root
Function Flow Chart
Sample Results
• For an input of 64 the output is 8, as expected
Execution time = 0.60us (~600ns)
From terminal
ECE 5655/4655 Real-Time DSP 3–47
Chapter 3 • Cortex-M4 Architecture and ASM Programming
• Physical is comparable at :
• For an input of 99 the output is 9 (81 is closest to 99), asexpected
Useful Resources
• Architecture Reference Manual:http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0403c/index.html
• Cortex-M4 Technical Reference Manual:http://infocenter.arm.com/help/topic/com.arm.doc.ddi0439d/DDI0439D_cortex_m4_proces-
sor_r0p1_trm.pdf
• Cortex-M4 Devices Generic User Guide:http://infocenter.arm.com/help/topic/com.arm.doc.dui0553a/DUI0553A_cortex_m4_dgug.pdf
710ns
From terminal
3–48 ECE 5655/4655 Real-Time DSP