Intro Spu Optimizations Part 1

8/3/2019 Intro Spu Optimizations Part 1

1/62

Introduction to SPU Optimizations

Part 1: Assembly Instructions

Pal-Kristian Engstadpal [email protected]

March 5, 2010

Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations
http://goforward/http://find/http://goback/


2/62

Introduction

These slides are used internally at Naughty Dog to introduce new programmersto our SPU programming methods. Due to popular interest, we are nowmaking these public. Note that some of the tools that we are using are notreleased to the public, but there exists many other alternatives out there thatdo similar things.

The rst set of slides introduce most of the SPU assembly instructions. Pleaseread these carefully before reading the second set. Those slides go through amade-up example showing how one can improve performance drastically, byknowing the hardware as well as employing a technique called software

pipe-lining.



3/62

SPU programming is Cool

In these slides, we will go through all of the assembly instructions that exist onthe SPU, giving you a quick introduction to the power of the SPUs.

Each SPU has 256 kB of local memory.

This local memory can be thought of as 1 cycle memory.Programs and data exist in the same local memory space.There are no memory protections in local memory!The only way to access external memory is through DMA.There is a signicant delay between when a DMA request is queued untilit nishes.



4/62

SPU Execution Environment

The SPU has 128 general purpose 128-bit wide registers.You can think of these as

2 doubles (64-bit oating-point values),4 oats (32-bit oating-point values),4 words (32-bit integer values),8 half-words (16-bit integer values), or16 bytes (8-bit integer values).

An SPU executes an even and an odd instruction each cycle.Even instructions are mostly arithmetic instructions, whereasthe odd ones are load/store instructions, shuffles, branches and other

special instructions.



5/62

Instruction Classes

The instruction set can be put in classes, where the instructions in the same

class have the same arity (i.e. whether they are even or odd) and latency (howlong it takes for the result to be ready):

(SP) Single Precision {e6 }(FX) FiXed {e2 }(WS) Word Shift {e4 }(LS) Load/Store {o6}(SH) SHuffle {o4}(FI) Fp Integer {e7 }(BO) Byte Operations

{e4

}(BR) BRanch {o- }(HB) Hint Branch {o15 }(CH) CHannel Operations {o6}(DP) Double Precision {e13 }



6/62

Single Precision Floating Point Class (SP) [Even:6]

The SP class of instructions have latency of 6 cycles and a throughput of 1cycle. These are all even instructions.

fa a, b, c ; a.f[n] = b.f[n] + c.f[n]fs a, b, c ; a.f[n] = b.f[n] - c.f[n]fm a, b, c ; a.f[n] = b.f[n] * c.f[n]fma a, b, c, d ; a.f[n] = b.f[n] * c.f[n] + d.f[n]fms a, b, c, d ; a.f[n] = b.f[n] * c.f[n] - d.f[n]fnms a, b, c, d ; a.f[n] = -(b.f[n] * c.f[n] - d.f[n])

The syntax here indicates that for each of the 4 32-bit oating point values inthe register, the operation in the comment is executed.



7/62

Single Precision Floating Point Class (SP)

No broadcast versions.No dot-products or cross-products.No fnma instruction.

Example:If the registers r1 and r2 contains

r1 = ( 1.0, 2.0, 3.0, 4.0 ),r2 = ( 0.0, -2.0, 1.0, 4.0 ),

then after

fa r0, r1, r2 ; r0 = r1 + r2then r0 contains

r0 = ( 1.0, 0.0, 4.0, 8.0 ).



8/62

FiXed precision Class (FX) [Even:2]

The FX class of instructions all have latency of just 2 cycles and all have athroughput of 1 cycle. These are even instructions.

Theres quite a few of them, and we can further divide them down into:

Integer Arithmetic Operations.Immediate Loads Operations.Comparison Operations.Select Bit Operation.Logical Bit Operations.Extensions and Misc Operations.



9/62


10/62

FX: Arithmetic Operations & Examples

Notice the subtract from semantics. This is different from the oating pointsubtract (fs ) semantic. We think this was mainly due to the additional powerof the immediate forms.

ai i, i, 1 ; i = i + 1, for each word in iahi i, i, -1 ; i = i - 1, for each half-word in isfi i, i, 0 ; i = (-i), for each word in isfhi x, x, 1 ; x = 1 - x, for each half-word in isf z, y, x ; z = x - y, for each word in i



11/62

FX: Immediate Loads

The SPU has some instructions that enable us to quickly set up registersvalues. These immediate loads are also 2-cycle FX instructions:

il i, s16 ; i.w[n] = ext(s16)ilh i, u16 ; i.h[n] = u16ila i, u18 ; i.w[n] = u18ilhu i, u16 ; i.w[n] = u16


12/62

FX: Logical Bit Operations

These instructions work on each of the 128 bits in the registers.

and i, j, k ; i = j & knand i, j, k ; i = ~( j & k)andc i, j, k ; i = j & ~kor i, j, k ; i = j | knor i, j, k ; i = ~( j | k)

orc i, j, k ; i = j | ~kxor i, j, k ; i = j ^ keqv i, j, k ; i = j == k



13/62

FX: Logical Operations w/immediates

andbi i, j, u8 ; i.b[n] = j.b[n] & u8andhi i, j, s10 ; i.h[n] = j.h[n] & ext(s10)andi i, j, s10 ; i.w[n] = j.w[n] & ext(s10)

orbi i, j, u8 ; i.b[n] = j.b[n] | u8orhi i, j, s10 ; i.h[n] = j.h[n] | ext(s10)ori i, j, s10 ; i.w[n] = j.w[n] | ext(s10)

xorbi i, j, u8 ; i.b[n] = j.b[n] ^ u8xorhi i, j, s10 ; i.h[n] = j.h[n] ^ ext(s10)xori i, j, s10 ; i.w[n] = j.w[n] ^ ext(s10)



14/62

FX: Comparisons (Bytes)

ceqb i, j, k ; i.b[n] = (j.b[n] == k.b[n]) ? TRUE : FALSEceqbi i, j, su8 ; i.b[n] = (j.b[n] == su8) ? TRUE : FALSEcgtb i, j, k ; i.b[n] = (j.b[n] > k.b[n]) ? TRUE : FALSE (s)cgtbi i, j, su8 ; i.b[n] = (j.b[n] > su8) ? TRUE : FALSEclgtb i, j, k ; i.b[n] = (j.b[n] > k.b[n]) ? TRUE : FALSE (u)clgtbi i, j, su8 ; i.b[n] = (j.b[n] > su8) ? TRUE : FALSE

TRUE = 0xFFFALSE = 0x00

(s) means signed and (u) means unsigned compares.



15/62

FX: Comparisons (Halves)

ceqh i, j, k ; i.h[n] = (j.h[n] == k.h[n]) ? TRUE : FALSEceqhi i, j, s10 ; i.h[n] = (j.h[n] == ext(s10)) ? TRUE : FALSEcgth i, j, k ; i.h[n] = (j.h[n] > k.h[n]) ? TRUE : FALSE (s)cgthi i, j, s10 ; i.h[n] = (j.h[n] > ext(s10)) ? TRUE : FALSE (s)clgth i, j, k ; i.h[n] = (j.h[n] > k.h[n]) ? TRUE : FALSE (u)clgthi i, j, s10 ; i.h[n] = (j.h[n] > ext(s10)) ? TRUE : FALSE (u)

TRUE = 0xFFFFFALSE = 0x0000



16/62

FX: Comparisons (Words)

ceq i, j, k ; i.w[n] = (j.w[n] == k.w[n]) ? TRUE : FALSEceqi i, j, s10 ; i.w[n] = (j.w[n] == ext(s10)) ? TRUE : FALSEcgt i, j, k ; i.w[n] = (j.w[n] > k.w[n]) ? TRUE : FALSE (s)cgti i, j, s10 ; i.w[n] = (j.w[n] > ext(s10)) ? TRUE : FALSE (s)clgt i, j, k ; i.w[n] = (j.w[n] > k.w[n]) ? TRUE : FALSE (u)clgti i, j, s10 ; i.w[n] = (j.w[n] > ext(s10)) ? TRUE : FALSE (u)

TRUE = 0xFFFF_FFFFFALSE = 0x0000_0000



17/62

FX: Comparisons (Floats)

fceq i, b, c ; i.w[n] = (b[n] == c[n]) ? TRUE : FALSEfcmeq i, b, c ; i.w[n] = (abs(b[n]) == abs(c[n])) ? TRUE : FALSEfcgt i, b, c ; i.w[n] = (b[n] > c[n]) ? TRUE : FALSEfcmgt i, b, c ; i.w[n] = (abs(b[n]) > abs(c[n])) ? TRUE : FALSE

TRUE = 0xFFFF_FFFFFALSE = 0x0000_0000

Note: All zeros are equal, e.g.: 0.0 == -0.0 .



18/62

FX: Select Bits

This very important operation selects bits from j and k depending on the bitsin the l registers. These t well with the comparison functions given previously.

selb i, j, k, l ; i = ( l==0) ? j : k

Notice that if the bit is 0, then it selects j and if not then it selects the bit in k.

Example: SIMD min/max

fcgt mask, a, b ; mask is all 1s if a > bselb max, b, a, mask ; select a if a > bselb min, a, b, mask ; select b if !(a > b)



19/62

FX: Misc

generate borrow bit

bg i, j, k ; tmp.w[n] = (-j.w[n] + k.w[n])i.w[n] = tmp.w[n] < 0 ? 0 : 1

generate borrow bit with borrowbgx i, j, k ; tmp.w[n] = (-j.w[n] + k.w[n] + (i.w[n]&1) - 1)

i.w[n] = tmp.w[n] < 0 ? 0 : 1generate carry bitcg i, j, k ; i.w[n] = (j.w[n] + k.w[n]) > 0xffffffff ? 1 : 0generate carry bit with carrycgx i, j, k ; tmp.w[n] = (j.w[n] + k.w[n] + (i.w[n] & 1)

i.w[n] = tmp.w[n] > 0xffffffff ? 1 : 0



20/62


21/62

Word Shift Class (WS) [Even:4]

The WS class of instructions have latency of 4 cycles and a throughput of 1cycle. These are all even instructions.

shlh i, j, k ; i.h[n] = j.h[n]


22/62

Example

; Assume r0 = ( 1, 2, 4, 8 )

; r1 = ( 1, 2, 3, 4 )shl r2, r0, r1; Now r2 = ( 1


23/62

WS: Rotate left logical

roth i, j, k ; i.h[n] = j.h[n]


24/62

WS: Shift right logical

rothm i, j, k ; i.h[n] = j.h[n] >> ( -k.h[n] & 0x1f )

rothmi i, j, imm ; i.h[n] = j.h[n] >> ( -imm & 0x1f )rotm i, j, k ; i.w[n] = j.w[n] >> ( -k.w[n] & 0x3f )rotmi i, j, imm ; i.w[n] = j.w[n] >> ( -imm & 0x3f )

Notice here that the shift amounts need to be negative in order to produce a

proper shift. This is because this is actually a rotate left and then maskoperation.



25/62

WS: Shift right arithmetic

rotmah i, j, k ; i.h[n] = j.h[n] >> ( -k.h[n] & 0x1f )

rotmahi i, j, imm ; i.h[n] = j.h[n] >> ( -imm & 0x1f )rotma i, j, k ; i.w[n] = j.w[n] >> ( -k.w[n] & 0x3f )rotmai i, j, imm ; i.w[n] = j.w[n] >> ( -imm & 0x3f )



26/62

Load/Store Class (LS) [Odd:6]

The load/store operations are odd instructions that work on the 256 kB localmemory. They have a latency of 6 cycles, but the hardware has short-cuts inplace so that you can read a written value immediately after the store. Do note:

Memory wraps around, so you can never access memory outside the local

store (LS).You can only load and store a whole quadword, so if you need to modify apart, you need to load the quadword value, merge in the modied partinto the value and store the whole quadword back.Addresses are in units of bytes, unlike the VUs on the PS2.

The load/store operations will use the value in the preferred word of theaddress register, i.e.: the rst word.



27/62

LS: Loads

lqa i, label18 ; addr = label18

; range = 256kb (or +/- 128kb)lqd i, qoff(j) ; addr = qoff * 16 + j.w[0]

; qoff is 10 bit signed, addr range = +/-8kb.lqr i, label14 ; addr = ext(label14) + pc

; label14 range = +/- 8kb.lqx i, j, k ; addr = j.w[0] + k.w[0]



28/62

LS: Stores

stqa i, label18 ; addr = label18

; range = 256kb (or +/- 128kb)stqd i, qoff(j) ; addr = qoff * 16 + j.w[0]

; qoff is 10 bit signed, addr range = +/-8kb.stqr i, label14 ; addr = ext(label14) + pc

; label14 range = +/- 8kb.stqx i, j, k ; addr = j.w[0] + k.w[0]


Sh ffl Cl (SH) [Odd 4]


29/62

Shuffle Class (SH) [Odd:4]

The shuffle operations all have 4 cycle latency and they are odd instructions.Most of the instructions in this class deal with the whole quadword:

We can divide the SH class into:

The Shuffle Bytes Instruction.Quadword left-shifts, rotates and right-shifts.Creation of Shuffle Masks.Form Select Instructions.Gather Bit Instructions.Reciprocal Estimate Instructions.


SH Sh ffl B


30/62

SH: Shuffle Bytes

The ordering of bytes, half-words and words within the quadword is shownbelow. Notice that this is big-endian, not little-endian:

+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | a | b | c | d | e | f |+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |+-------+-------+-------+-------+-------+-------+-------+-------+| 0 | 1 | 2 | 3 |+---------------+---------------+---------------+---------------+

The shuffle byte instruction shufb take three inputs, two source registers r0 ,r1 , and a shuffle mask msk. The output register d is found by running thefollowing logic on each byte within the input registers:


SH Sh ffl B t


31/62

SH: Shuffle Bytes

Let x = msk.b[n] , where n goes from 0 to 15:

if x in 0 .. 0x7f:If (x & 0x10) == 0x00, then d.b[n] = r0.b[x & 0x0f].If (x & 0x10) == 0x10, then d.b[n] = r1.b[x & 0x0f].

if x in 0x80 .. 0xbf: d.b[n] = 0x00if x in 0xc0 .. 0xdf: d.b[n] = 0xff if x in 0xe0 .. 0xff: d.b[n] = 0x80

This is very powerful stuff!


SH: Shufb Examples


32/62

SH: Shufb Examples

Previously, we mentioned that the SPU has no broadcast ability, but with asingle shufb instruction we can broadcast one word into all words. We cancreate the shuffle masks using instructions directly, or else we could simply loadit using a LS class instruction.

ila s_AAAA, 0x10203 ; s_AAAA = 0x00_01_02_03 x 4; = 0x00010203_00010203_00010203_00010203

orbi s_CCCC, s_AAAA, 8 ; s_CCCC = 0x08_09_0a_0b x 4

Using these masks, we can quickly create a registers with all xs, ys, zs or ws:

shufb xs, v, v, s_AAAA ; xs = (v.x, v.x, v.x, v.x)

shufb zs, v, v, s_CCCC ; zs = (v.z, v.z, v.z, v.z)


SH: dshuf


33/62

SH: .dshuf

Because the shuffle instruction is so useful, our frontend tool supports quickcreations of shuffle masks. Using the .dshuf directive, we create shuffle masksthat follow the following rules.

If the length of the string is 4, we assume it is word-sized shuffles, if 8then half-word sized, and if 16 then byte-sized shuffles,upper-cased letters indicate sources from the rst input, lower-cased ones

indicate from the second input,0 indicates zeros, X ones and 8 0x80s.

.dshuf "ABC0" ; 0x00010203_04050607_08090a0b_80808080

.dshuf "aX08" ; 0x10111213_c0c0c0c0_80808080_e0e0e0e0

.dshuf "aBC0aBC0" ; 0x1011_0203_0405_8080_1011_0203_0405_8080


SH: Another Shufb Example


34/62

SH: Another Shufb Example

We can create nite state-machines, piping input into one end of thequad-word, while spitting out the result into another (like e.g. the preferredword). Heres an example of such a delay machine:

; in the data section: m_bcdA: .dshufb "bcdA"; in the init section:

lqa s_bcdA, 0(m_bcdA); in the loop:shufb state, input, state, s_bcdA ; state.x = state.y

; state.y = state.z; state.z = state.w; state.w = input.x


SH: Quadword Shift Left


35/62

SH: Quadword Shift Left

These instructions take the preferred byte (byte 3) or an immediate value,shifting the whole quadword to the left. There are versions that shift in numberof bytes as well as in number of bits. For bit shifts, the shift amount isclamped to be less than 8.

SHift Left Quadword by BYtesshlqby i, j, k ; i = j


36/62

SH: Quadword Rotate Left

These follow the same pattern as left shifts:

ROTate (left) Quadword by BYtesrotqby i, j, k ; i = j


37/62

SH: Quadword Shift Right

Ditto for shift rights, though as for the WS class, we call it rotates with maskand use the negative shift amounts:

ROTate and Mask Quadword by BYtesrotmqby i, j, k ; i = j >> ((-k.b[3] & 0x1f) * 8)ROTate and Mask Quadword by BYtes Immediaterotmqbyi i, j, imm ; i = j >> ((-imm & 0x1f) * 8)

ROTate and Mask Quadword by BYtes using BIt countrotmqbybi i, j, k ; i = j >> (-(k.b[3] & 0xf8) & 0xf8) (*)ROTate and Mask Quadword by BItsrotmqbi i, j, k ; i = j >> (-(k.b[3] & 0x07))ROTate and Mask Quadword by BIts Immediaterotmqbii i, j, imm ; i = j >> (-imm & 0x07)



38/62

SH: Gather Bits Instructions


39/62

SH: Gather Bits Instructions

These are the opposite to the form select instructions, and can be used toquickly pack results from comparison operators into compact bytes orhalf-words. They all gather the rightmost bit from the the source register andpacks it into a single bit in the target.

Gather Bits from Bytesgbb i, j ; i=0;for(n=0;n


40/62

g q

As seen in the section for load/store, there are no non-quadword load/storeoperations. A way to store a non-quadword value is to load the destination

quadword, shuffle the value with the loaded quadword, and store it back to thesame location. In order to make the process of generating these shuffle-masks,there are a few instructions that generate these control masks:

Generate Controls for Byte Insertion (d-form)

cbd i, imm(j)Generate Controls for Byte Insertion (x-form)cbx i, j, k


SH: How to generate masks for non-quadword stores.


41/62

g q

Generate Controls for Halfword Insertion (d-form)

chd i, imm(j)Generate Controls for Halfword Insertion (x-form)chx i, j, kGenerate Controls for Word Insertion (d-form)cwd i, imm(j)Generate Controls for Word Insertion (x-form)

cwx i, j, kGenerate Controls for Doubleword Insertion (d-form)cdd i, imm(j)Generate Controls for Doubleword Insertion (x-form)cdx i, j, k


SH: How to generate masks for non-quadword stores.


42/62

g q

Example: Store prefered byte into a table

lqx qword, table, offsetcbx mask, table, offsetshufb qword, value, qword, maskstqx qword, table, offsetai offset, offset, 1


SH: Reciprocal Estimate Instructions


43/62

The hardware supports two fast (4 cycles) that calculate the reciprocalrecip( x ) = 1 / x , or the reciprocal square root rsqrt( x ) = 1 / x . Theseinstructions work in conjunction with the instruction that well later explainin detail. After the interpolation instruction, result are accurate to a precisionof 12 bits, which is about half the oating-point precision of 23. In order toimprove the accuracy, one must perform another Taylor- or Euler-step.

Do note that:

sqrt( x ) = x = x x x = |x |1

x = x rsqrt( x ),since x 0, so there is no need for a seperate square-root function.


Improving precision on the reciprocal function


44/62

Assuming we have the input in the x -register, we proceed to calculate

frest a, xfi b, x, a ; b is good to 12 bits precisionfnms c, b, x, one ;fma b, c, b, b ; b is good to 24 bits precision

;


Improving precision on the reciprocal square-root function


45/62

frsqest a, x

fi b, x, a ; b is good to 12 bits precisionfm c, b, x ; (b and a can share register)fm d, b, onehalf ; (c and x can share register)fnms c, c, b, onefma b, d, c, b ; b is good to 24 bits precision


SH: Or Across - The Final Instruction


46/62

The last instruction in the SH class is a new addition.

Or Acrossorx i, j ; i.w[0] = ( j.w[0] | j.w[1] | j.w[2] | j.w[3] );

i.w[1] = i.w[2] = i.w[3] = 0


Floating point / Integer Class (FI) [Even:7]


47/62

The FI class of instructions have latency of 7 cycles and a throughput of 1cycle. These are all even instructions. There are basically three types of

instructions: integer multiplies, interpolations for reciprocal calculations, andnally, fp/integer conversions.


FI: Integer Multiplies


48/62

multiply lower halves signed

mpy i, j, k ; i.w[n] = j.h[2n+1] * k.h[2n+1] multiply lower halves signed immediate mpyi i, j, s10 ; i.w[n] = j.h[2n+1] * ext(s10) multiply lower halves unsigned mpyu i, j, k ; i.w[n] = j.h[2n+1] * k.h[2n+1] multiply lower halves unsigned immediate (immediate sign-extends)

mpyui i, j, s10 ; i.w[n] = j.h[2n+1] * ext(s10)


FI: Integer Multiplies


49/62

multiply lower halves, add word

mpya i, j, k, l ; i.w[n] = j.h[2n+1] * k.h[2n+1] + l.w[n] multiply lower halves, shift result down 16 with sign extend mpys i, j, k ; i.w[n] = j.h[2n+1] * k.h[2n+1] >> 16 multiply upper half j by lower half k, shift up 16 mpyh i, j, k ; i.w[n] = j.h[2n] * k.h[2n+1]


50/62

multiply upper halves signed

mpyhh i, j, k ; i.w[n] = j.h[2n] * k.h[2n] multiply upper halves unsigned mpyhhu i, j, k ; i.w[n] = j.h[2n] * k.h[2n] multiply/accumulate upper halves mpyhha i, j, k ; i.w[n] += j.h[2n] * k.h[2n] multiply/accumulate upper halves unsigned

mpyhhau i, j, k ; i.w[n] += j.h[2n] * k.h[2n]


FI: Conversions and FI instruction


51/62

fi a, b, c ; use after frest or frsqest

cuflt a, j, precis ; unsigned int to floatcsflt a, j, precis ; signed int to floatcfltu i, b, precis ; float to unsigned intcflts i, b, precis ; float to signed int

Here precis is the precision as an immediate, so that e.g.

cuflt fp, val, 8; converts 0x80 into 0.5

Also, please note that these instructions saturate to the min and max values of

their precision.


Byte Operations (BO) [Even: 4]


52/62

Theres a couple of interesting instructions that help with multi-media andstreaming logic.

Count Ones in Bytescntb i, j ; i.b[n] = numOneBits( j.b[n] )Average Bytesavgb i, j, k ; i.b[n] = ( j.b[n] + k.b[n] + 1 ) / 2Absolute Difference in Bytesabsdb i, j, k ; i.b[n] = abs( j.b[n] - k.b[n] )Sum Bytes into Half-wordssumb i, j, k ; i.h[0] = k.b[0] + k.b[1] + k.b[2] + k.b[3];

i.h[1] = j.b[0] + j.b[1] + j.b[2] + j.b[3];:

i.h[6] = k.b[12] + k.b[13] + k.b[14] + k.b[15];i.h[7] = j.b[12] + j.b[13] + j.b[14] + j.b[15];


Branch Class (BR) [Odd:-]


53/62

Branches on the SPU are costly. If a branch is taken, and it has not beenpredicted , there is a 18 cycle penalty so that the chip can restart the pipe.There is no penalty for falling through a non-predicted branch. However, if youhave predicted a branch, and this does not occur - then there is also a 18 cycle

penalty. Branches and branch hints are all odd instructions.Note: Even a static branch needs to be predicted.

Note: This is one of the reasons why diverging control-paths are so difficult tooptimize for.


BR: Unconditional Branches


54/62

Branch Relativebr brTo ; goto label addressBranch Relative and Set Linkbrsl i, brTo ; gosub label address, i.w[0] = return address, (*)Branch Indirectbi i ; goto i.w[0]Branch Indirect and Set Link

bisl i, j ; gosub j.w[0], i.w[0] = return address, (*)BRanch Absolutebra brTo ; goto brToBRanch Absolute and Set Linkbrasl i, brTo ; gosub label address, i.w[0] = return address (*)

(*): These instructions have a 4 cycle latency for the return register. Note:The bi instructions have enable/disable interrupt versions, e.g.: bie , bid ,bisle , bisld .


BR: Conditional Branches (Relative)


55/62

Branch on Zerobrz i, brTo ; branch if i.w[0] == 0Branch on Not Zerobrnz i, brTo ; branch if i.w[0] != 0Branch on Zerobrhz i, brTo ; branch if i.h[1] == 0Branch on Not Zero

brhnz i, brTo ; branch if i.h[1] != 0


BR: Conditional Branches (Indirect)


56/62

Branch Indirect on Zerobiz i, j ; branch to j.w[0] if i.w[0] == 0Branch Indirect on Not Zerobinz i, j ; branch to j.w[0] if i.w[0] != 0Branch Indirect on Zerobihz i, j ; branch to j.w[0] if i.h[1] == 0Branch Indirect on Not Zero

bihnz i, j ; branch to j.w[0] if i.h[1] != 0

Note: These instructions can enable/disable interrupts as well.


BR: Interrupt & Misc


57/62

Interrupt RETurniret i ; Return from interruptInterrupt RETurniretd i ; Return from interrupt, disable interruptsInterrupt RETurnirete i ; Return from interrupt, enable interruptsBranch Indirect and Set Link if External Data

bisled i, j ; gosub j if channel 0 is non-zero


Hints Branch Class (HB) [Odd:15]


58/62

If you know the most likely (or only) outcome for a branch, you can make surethe branch is penalty free as long as the hint occurs at least 15 cycles beforethe branch is taken. If the hint occurs later, there still may be a benet, sincethe penalty is lowered. However, if the hint arrives less than 4 cycles before thebranch, there is no benet.

Please note that it also turns out that there is a hardware bug w.r.t. the hbrinstructions. One cannot hint a branch where the branch targets forwards andis also within the same 64-byte block as the branch.


Hints Branch Instructions


59/62

Hint Branch (Immediate)hbr brFrom, j ; branch hint for any BIxxx type branchHint Branch Absolutehbra brFrom, brTo ; branch hint for any BRAxxx type branchHint Branch Relativehbrr brFrom, brTo ; branch hint for any BRxxx type branchHint Branch Prefetch

hbrp ; inline prefetch code (*)

(*) allows 15 LS instructions in a row without any instruction fetch stall.



60/62

DP: Double Precision


61/62

DP instructions have a latency of 13 and are even. However, they will stallpipelining for 6 cycles (that is all currently executing instructions are halted)

while this instruction is executed. Therefore, we do not recommend usingdouble precision at all!


Questions?


62/62

Thats all folks!


Intro Spu Optimizations Part 1

Documents

Transcript of Intro Spu Optimizations Part 1