Intro Spu Optimizations Part 1

download Intro Spu Optimizations Part 1

of 62

Transcript of Intro Spu Optimizations Part 1

  • 8/3/2019 Intro Spu Optimizations Part 1

    1/62

    Introduction to SPU Optimizations

    Part 1: Assembly Instructions

    Pal-Kristian Engstadpal [email protected]

    March 5, 2010

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    2/62

    Introduction

    These slides are used internally at Naughty Dog to introduce new programmersto our SPU programming methods. Due to popular interest, we are nowmaking these public. Note that some of the tools that we are using are notreleased to the public, but there exists many other alternatives out there thatdo similar things.

    The rst set of slides introduce most of the SPU assembly instructions. Pleaseread these carefully before reading the second set. Those slides go through amade-up example showing how one can improve performance drastically, byknowing the hardware as well as employing a technique called software

    pipe-lining.

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    3/62

    SPU programming is Cool

    In these slides, we will go through all of the assembly instructions that exist onthe SPU, giving you a quick introduction to the power of the SPUs.

    Each SPU has 256 kB of local memory.

    This local memory can be thought of as 1 cycle memory.Programs and data exist in the same local memory space.There are no memory protections in local memory!The only way to access external memory is through DMA.There is a signicant delay between when a DMA request is queued untilit nishes.

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    4/62

    SPU Execution Environment

    The SPU has 128 general purpose 128-bit wide registers.You can think of these as

    2 doubles (64-bit oating-point values),4 oats (32-bit oating-point values),4 words (32-bit integer values),8 half-words (16-bit integer values), or16 bytes (8-bit integer values).

    An SPU executes an even and an odd instruction each cycle.Even instructions are mostly arithmetic instructions, whereasthe odd ones are load/store instructions, shuffles, branches and other

    special instructions.

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    5/62

    Instruction Classes

    The instruction set can be put in classes, where the instructions in the same

    class have the same arity (i.e. whether they are even or odd) and latency (howlong it takes for the result to be ready):

    (SP) Single Precision {e6 }(FX) FiXed {e2 }(WS) Word Shift {e4 }(LS) Load/Store {o6}(SH) SHuffle {o4}(FI) Fp Integer {e7 }(BO) Byte Operations

    {e4

    }(BR) BRanch {o- }(HB) Hint Branch {o15 }(CH) CHannel Operations {o6}(DP) Double Precision {e13 }

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    6/62

    Single Precision Floating Point Class (SP) [Even:6]

    The SP class of instructions have latency of 6 cycles and a throughput of 1cycle. These are all even instructions.

    fa a, b, c ; a.f[n] = b.f[n] + c.f[n]fs a, b, c ; a.f[n] = b.f[n] - c.f[n]fm a, b, c ; a.f[n] = b.f[n] * c.f[n]fma a, b, c, d ; a.f[n] = b.f[n] * c.f[n] + d.f[n]fms a, b, c, d ; a.f[n] = b.f[n] * c.f[n] - d.f[n]fnms a, b, c, d ; a.f[n] = -(b.f[n] * c.f[n] - d.f[n])

    The syntax here indicates that for each of the 4 32-bit oating point values inthe register, the operation in the comment is executed.

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    7/62

    Single Precision Floating Point Class (SP)

    No broadcast versions.No dot-products or cross-products.No fnma instruction.

    Example:If the registers r1 and r2 contains

    r1 = ( 1.0, 2.0, 3.0, 4.0 ),r2 = ( 0.0, -2.0, 1.0, 4.0 ),

    then after

    fa r0, r1, r2 ; r0 = r1 + r2then r0 contains

    r0 = ( 1.0, 0.0, 4.0, 8.0 ).

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    8/62

    FiXed precision Class (FX) [Even:2]

    The FX class of instructions all have latency of just 2 cycles and all have athroughput of 1 cycle. These are even instructions.

    Theres quite a few of them, and we can further divide them down into:

    Integer Arithmetic Operations.Immediate Loads Operations.Comparison Operations.Select Bit Operation.Logical Bit Operations.Extensions and Misc Operations.

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    9/62

  • 8/3/2019 Intro Spu Optimizations Part 1

    10/62

    FX: Arithmetic Operations & Examples

    Notice the subtract from semantics. This is different from the oating pointsubtract (fs ) semantic. We think this was mainly due to the additional powerof the immediate forms.

    ai i, i, 1 ; i = i + 1, for each word in iahi i, i, -1 ; i = i - 1, for each half-word in isfi i, i, 0 ; i = (-i), for each word in isfhi x, x, 1 ; x = 1 - x, for each half-word in isf z, y, x ; z = x - y, for each word in i

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    11/62

    FX: Immediate Loads

    The SPU has some instructions that enable us to quickly set up registersvalues. These immediate loads are also 2-cycle FX instructions:

    il i, s16 ; i.w[n] = ext(s16)ilh i, u16 ; i.h[n] = u16ila i, u18 ; i.w[n] = u18ilhu i, u16 ; i.w[n] = u16

  • 8/3/2019 Intro Spu Optimizations Part 1

    12/62

    FX: Logical Bit Operations

    These instructions work on each of the 128 bits in the registers.

    and i, j, k ; i = j & knand i, j, k ; i = ~( j & k)andc i, j, k ; i = j & ~kor i, j, k ; i = j | knor i, j, k ; i = ~( j | k)

    orc i, j, k ; i = j | ~kxor i, j, k ; i = j ^ keqv i, j, k ; i = j == k

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    13/62

    FX: Logical Operations w/immediates

    andbi i, j, u8 ; i.b[n] = j.b[n] & u8andhi i, j, s10 ; i.h[n] = j.h[n] & ext(s10)andi i, j, s10 ; i.w[n] = j.w[n] & ext(s10)

    orbi i, j, u8 ; i.b[n] = j.b[n] | u8orhi i, j, s10 ; i.h[n] = j.h[n] | ext(s10)ori i, j, s10 ; i.w[n] = j.w[n] | ext(s10)

    xorbi i, j, u8 ; i.b[n] = j.b[n] ^ u8xorhi i, j, s10 ; i.h[n] = j.h[n] ^ ext(s10)xori i, j, s10 ; i.w[n] = j.w[n] ^ ext(s10)

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    14/62

    FX: Comparisons (Bytes)

    ceqb i, j, k ; i.b[n] = (j.b[n] == k.b[n]) ? TRUE : FALSEceqbi i, j, su8 ; i.b[n] = (j.b[n] == su8) ? TRUE : FALSEcgtb i, j, k ; i.b[n] = (j.b[n] > k.b[n]) ? TRUE : FALSE (s)cgtbi i, j, su8 ; i.b[n] = (j.b[n] > su8) ? TRUE : FALSEclgtb i, j, k ; i.b[n] = (j.b[n] > k.b[n]) ? TRUE : FALSE (u)clgtbi i, j, su8 ; i.b[n] = (j.b[n] > su8) ? TRUE : FALSE

    TRUE = 0xFFFALSE = 0x00

    (s) means signed and (u) means unsigned compares.

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    15/62

    FX: Comparisons (Halves)

    ceqh i, j, k ; i.h[n] = (j.h[n] == k.h[n]) ? TRUE : FALSEceqhi i, j, s10 ; i.h[n] = (j.h[n] == ext(s10)) ? TRUE : FALSEcgth i, j, k ; i.h[n] = (j.h[n] > k.h[n]) ? TRUE : FALSE (s)cgthi i, j, s10 ; i.h[n] = (j.h[n] > ext(s10)) ? TRUE : FALSE (s)clgth i, j, k ; i.h[n] = (j.h[n] > k.h[n]) ? TRUE : FALSE (u)clgthi i, j, s10 ; i.h[n] = (j.h[n] > ext(s10)) ? TRUE : FALSE (u)

    TRUE = 0xFFFFFALSE = 0x0000

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    16/62

    FX: Comparisons (Words)

    ceq i, j, k ; i.w[n] = (j.w[n] == k.w[n]) ? TRUE : FALSEceqi i, j, s10 ; i.w[n] = (j.w[n] == ext(s10)) ? TRUE : FALSEcgt i, j, k ; i.w[n] = (j.w[n] > k.w[n]) ? TRUE : FALSE (s)cgti i, j, s10 ; i.w[n] = (j.w[n] > ext(s10)) ? TRUE : FALSE (s)clgt i, j, k ; i.w[n] = (j.w[n] > k.w[n]) ? TRUE : FALSE (u)clgti i, j, s10 ; i.w[n] = (j.w[n] > ext(s10)) ? TRUE : FALSE (u)

    TRUE = 0xFFFF_FFFFFALSE = 0x0000_0000

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    17/62

    FX: Comparisons (Floats)

    fceq i, b, c ; i.w[n] = (b[n] == c[n]) ? TRUE : FALSEfcmeq i, b, c ; i.w[n] = (abs(b[n]) == abs(c[n])) ? TRUE : FALSEfcgt i, b, c ; i.w[n] = (b[n] > c[n]) ? TRUE : FALSEfcmgt i, b, c ; i.w[n] = (abs(b[n]) > abs(c[n])) ? TRUE : FALSE

    TRUE = 0xFFFF_FFFFFALSE = 0x0000_0000

    Note: All zeros are equal, e.g.: 0.0 == -0.0 .

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    18/62

    FX: Select Bits

    This very important operation selects bits from j and k depending on the bitsin the l registers. These t well with the comparison functions given previously.

    selb i, j, k, l ; i = ( l==0) ? j : k

    Notice that if the bit is 0, then it selects j and if not then it selects the bit in k.

    Example: SIMD min/max

    fcgt mask, a, b ; mask is all 1s if a > bselb max, b, a, mask ; select a if a > bselb min, a, b, mask ; select b if !(a > b)

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    19/62

    FX: Misc

    generate borrow bit

    bg i, j, k ; tmp.w[n] = (-j.w[n] + k.w[n])i.w[n] = tmp.w[n] < 0 ? 0 : 1

    generate borrow bit with borrowbgx i, j, k ; tmp.w[n] = (-j.w[n] + k.w[n] + (i.w[n]&1) - 1)

    i.w[n] = tmp.w[n] < 0 ? 0 : 1generate carry bitcg i, j, k ; i.w[n] = (j.w[n] + k.w[n]) > 0xffffffff ? 1 : 0generate carry bit with carrycgx i, j, k ; tmp.w[n] = (j.w[n] + k.w[n] + (i.w[n] & 1)

    i.w[n] = tmp.w[n] > 0xffffffff ? 1 : 0

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    20/62

  • 8/3/2019 Intro Spu Optimizations Part 1

    21/62

    Word Shift Class (WS) [Even:4]

    The WS class of instructions have latency of 4 cycles and a throughput of 1cycle. These are all even instructions.

    shlh i, j, k ; i.h[n] = j.h[n]

  • 8/3/2019 Intro Spu Optimizations Part 1

    22/62

    Example

    ; Assume r0 = ( 1, 2, 4, 8 )

    ; r1 = ( 1, 2, 3, 4 )shl r2, r0, r1; Now r2 = ( 1

  • 8/3/2019 Intro Spu Optimizations Part 1

    23/62

    WS: Rotate left logical

    roth i, j, k ; i.h[n] = j.h[n]

  • 8/3/2019 Intro Spu Optimizations Part 1

    24/62

    WS: Shift right logical

    rothm i, j, k ; i.h[n] = j.h[n] >> ( -k.h[n] & 0x1f )

    rothmi i, j, imm ; i.h[n] = j.h[n] >> ( -imm & 0x1f )rotm i, j, k ; i.w[n] = j.w[n] >> ( -k.w[n] & 0x3f )rotmi i, j, imm ; i.w[n] = j.w[n] >> ( -imm & 0x3f )

    Notice here that the shift amounts need to be negative in order to produce a

    proper shift. This is because this is actually a rotate left and then maskoperation.

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    25/62

    WS: Shift right arithmetic

    rotmah i, j, k ; i.h[n] = j.h[n] >> ( -k.h[n] & 0x1f )

    rotmahi i, j, imm ; i.h[n] = j.h[n] >> ( -imm & 0x1f )rotma i, j, k ; i.w[n] = j.w[n] >> ( -k.w[n] & 0x3f )rotmai i, j, imm ; i.w[n] = j.w[n] >> ( -imm & 0x3f )

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    26/62

    Load/Store Class (LS) [Odd:6]

    The load/store operations are odd instructions that work on the 256 kB localmemory. They have a latency of 6 cycles, but the hardware has short-cuts inplace so that you can read a written value immediately after the store. Do note:

    Memory wraps around, so you can never access memory outside the local

    store (LS).You can only load and store a whole quadword, so if you need to modify apart, you need to load the quadword value, merge in the modied partinto the value and store the whole quadword back.Addresses are in units of bytes, unlike the VUs on the PS2.

    The load/store operations will use the value in the preferred word of theaddress register, i.e.: the rst word.

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    27/62

    LS: Loads

    lqa i, label18 ; addr = label18

    ; range = 256kb (or +/- 128kb)lqd i, qoff(j) ; addr = qoff * 16 + j.w[0]

    ; qoff is 10 bit signed, addr range = +/-8kb.lqr i, label14 ; addr = ext(label14) + pc

    ; label14 range = +/- 8kb.lqx i, j, k ; addr = j.w[0] + k.w[0]

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    28/62

    LS: Stores

    stqa i, label18 ; addr = label18

    ; range = 256kb (or +/- 128kb)stqd i, qoff(j) ; addr = qoff * 16 + j.w[0]

    ; qoff is 10 bit signed, addr range = +/-8kb.stqr i, label14 ; addr = ext(label14) + pc

    ; label14 range = +/- 8kb.stqx i, j, k ; addr = j.w[0] + k.w[0]

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    Sh ffl Cl (SH) [Odd 4]

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    29/62

    Shuffle Class (SH) [Odd:4]

    The shuffle operations all have 4 cycle latency and they are odd instructions.Most of the instructions in this class deal with the whole quadword:

    We can divide the SH class into:

    The Shuffle Bytes Instruction.Quadword left-shifts, rotates and right-shifts.Creation of Shuffle Masks.Form Select Instructions.Gather Bit Instructions.Reciprocal Estimate Instructions.

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    SH Sh ffl B

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    30/62

    SH: Shuffle Bytes

    The ordering of bytes, half-words and words within the quadword is shownbelow. Notice that this is big-endian, not little-endian:

    +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | a | b | c | d | e | f |+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |+-------+-------+-------+-------+-------+-------+-------+-------+| 0 | 1 | 2 | 3 |+---------------+---------------+---------------+---------------+

    The shuffle byte instruction shufb take three inputs, two source registers r0 ,r1 , and a shuffle mask msk. The output register d is found by running thefollowing logic on each byte within the input registers:

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    SH Sh ffl B t

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    31/62

    SH: Shuffle Bytes

    Let x = msk.b[n] , where n goes from 0 to 15:

    if x in 0 .. 0x7f:If (x & 0x10) == 0x00, then d.b[n] = r0.b[x & 0x0f].If (x & 0x10) == 0x10, then d.b[n] = r1.b[x & 0x0f].

    if x in 0x80 .. 0xbf: d.b[n] = 0x00if x in 0xc0 .. 0xdf: d.b[n] = 0xff if x in 0xe0 .. 0xff: d.b[n] = 0x80

    This is very powerful stuff!

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    SH: Shufb Examples

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    32/62

    SH: Shufb Examples

    Previously, we mentioned that the SPU has no broadcast ability, but with asingle shufb instruction we can broadcast one word into all words. We cancreate the shuffle masks using instructions directly, or else we could simply loadit using a LS class instruction.

    ila s_AAAA, 0x10203 ; s_AAAA = 0x00_01_02_03 x 4; = 0x00010203_00010203_00010203_00010203

    orbi s_CCCC, s_AAAA, 8 ; s_CCCC = 0x08_09_0a_0b x 4

    Using these masks, we can quickly create a registers with all xs, ys, zs or ws:

    shufb xs, v, v, s_AAAA ; xs = (v.x, v.x, v.x, v.x)

    shufb zs, v, v, s_CCCC ; zs = (v.z, v.z, v.z, v.z)

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    SH: dshuf

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    33/62

    SH: .dshuf

    Because the shuffle instruction is so useful, our frontend tool supports quickcreations of shuffle masks. Using the .dshuf directive, we create shuffle masksthat follow the following rules.

    If the length of the string is 4, we assume it is word-sized shuffles, if 8then half-word sized, and if 16 then byte-sized shuffles,upper-cased letters indicate sources from the rst input, lower-cased ones

    indicate from the second input,0 indicates zeros, X ones and 8 0x80s.

    .dshuf "ABC0" ; 0x00010203_04050607_08090a0b_80808080

    .dshuf "aX08" ; 0x10111213_c0c0c0c0_80808080_e0e0e0e0

    .dshuf "aBC0aBC0" ; 0x1011_0203_0405_8080_1011_0203_0405_8080

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    SH: Another Shufb Example

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    34/62

    SH: Another Shufb Example

    We can create nite state-machines, piping input into one end of thequad-word, while spitting out the result into another (like e.g. the preferredword). Heres an example of such a delay machine:

    ; in the data section: m_bcdA: .dshufb "bcdA"; in the init section:

    lqa s_bcdA, 0(m_bcdA); in the loop:shufb state, input, state, s_bcdA ; state.x = state.y

    ; state.y = state.z; state.z = state.w; state.w = input.x

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    SH: Quadword Shift Left

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    35/62

    SH: Quadword Shift Left

    These instructions take the preferred byte (byte 3) or an immediate value,shifting the whole quadword to the left. There are versions that shift in numberof bytes as well as in number of bits. For bit shifts, the shift amount isclamped to be less than 8.

    SHift Left Quadword by BYtesshlqby i, j, k ; i = j

  • 8/3/2019 Intro Spu Optimizations Part 1

    36/62

    SH: Quadword Rotate Left

    These follow the same pattern as left shifts:

    ROTate (left) Quadword by BYtesrotqby i, j, k ; i = j

  • 8/3/2019 Intro Spu Optimizations Part 1

    37/62

    SH: Quadword Shift Right

    Ditto for shift rights, though as for the WS class, we call it rotates with maskand use the negative shift amounts:

    ROTate and Mask Quadword by BYtesrotmqby i, j, k ; i = j >> ((-k.b[3] & 0x1f) * 8)ROTate and Mask Quadword by BYtes Immediaterotmqbyi i, j, imm ; i = j >> ((-imm & 0x1f) * 8)

    ROTate and Mask Quadword by BYtes using BIt countrotmqbybi i, j, k ; i = j >> (-(k.b[3] & 0xf8) & 0xf8) (*)ROTate and Mask Quadword by BItsrotmqbi i, j, k ; i = j >> (-(k.b[3] & 0x07))ROTate and Mask Quadword by BIts Immediaterotmqbii i, j, imm ; i = j >> (-imm & 0x07)

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    38/62

    SH: Gather Bits Instructions

  • 8/3/2019 Intro Spu Optimizations Part 1

    39/62

    SH: Gather Bits Instructions

    These are the opposite to the form select instructions, and can be used toquickly pack results from comparison operators into compact bytes orhalf-words. They all gather the rightmost bit from the the source register andpacks it into a single bit in the target.

    Gather Bits from Bytesgbb i, j ; i=0;for(n=0;n

  • 8/3/2019 Intro Spu Optimizations Part 1

    40/62

    g q

    As seen in the section for load/store, there are no non-quadword load/storeoperations. A way to store a non-quadword value is to load the destination

    quadword, shuffle the value with the loaded quadword, and store it back to thesame location. In order to make the process of generating these shuffle-masks,there are a few instructions that generate these control masks:

    Generate Controls for Byte Insertion (d-form)

    cbd i, imm(j)Generate Controls for Byte Insertion (x-form)cbx i, j, k

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    SH: How to generate masks for non-quadword stores.

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    41/62

    g q

    Generate Controls for Halfword Insertion (d-form)

    chd i, imm(j)Generate Controls for Halfword Insertion (x-form)chx i, j, kGenerate Controls for Word Insertion (d-form)cwd i, imm(j)Generate Controls for Word Insertion (x-form)

    cwx i, j, kGenerate Controls for Doubleword Insertion (d-form)cdd i, imm(j)Generate Controls for Doubleword Insertion (x-form)cdx i, j, k

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    SH: How to generate masks for non-quadword stores.

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    42/62

    g q

    Example: Store prefered byte into a table

    lqx qword, table, offsetcbx mask, table, offsetshufb qword, value, qword, maskstqx qword, table, offsetai offset, offset, 1

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    SH: Reciprocal Estimate Instructions

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    43/62

    The hardware supports two fast (4 cycles) that calculate the reciprocalrecip( x ) = 1 / x , or the reciprocal square root rsqrt( x ) = 1 / x . Theseinstructions work in conjunction with the instruction that well later explainin detail. After the interpolation instruction, result are accurate to a precisionof 12 bits, which is about half the oating-point precision of 23. In order toimprove the accuracy, one must perform another Taylor- or Euler-step.

    Do note that:

    sqrt( x ) = x = x x x = |x |1

    x = x rsqrt( x ),since x 0, so there is no need for a seperate square-root function.

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    Improving precision on the reciprocal function

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    44/62

    Assuming we have the input in the x -register, we proceed to calculate

    frest a, xfi b, x, a ; b is good to 12 bits precisionfnms c, b, x, one ;fma b, c, b, b ; b is good to 24 bits precision

    ;

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    Improving precision on the reciprocal square-root function

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    45/62

    frsqest a, x

    fi b, x, a ; b is good to 12 bits precisionfm c, b, x ; (b and a can share register)fm d, b, onehalf ; (c and x can share register)fnms c, c, b, onefma b, d, c, b ; b is good to 24 bits precision

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    SH: Or Across - The Final Instruction

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    46/62

    The last instruction in the SH class is a new addition.

    Or Acrossorx i, j ; i.w[0] = ( j.w[0] | j.w[1] | j.w[2] | j.w[3] );

    i.w[1] = i.w[2] = i.w[3] = 0

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    Floating point / Integer Class (FI) [Even:7]

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    47/62

    The FI class of instructions have latency of 7 cycles and a throughput of 1cycle. These are all even instructions. There are basically three types of

    instructions: integer multiplies, interpolations for reciprocal calculations, andnally, fp/integer conversions.

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    FI: Integer Multiplies

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    48/62

    multiply lower halves signed

    mpy i, j, k ; i.w[n] = j.h[2n+1] * k.h[2n+1] multiply lower halves signed immediate mpyi i, j, s10 ; i.w[n] = j.h[2n+1] * ext(s10) multiply lower halves unsigned mpyu i, j, k ; i.w[n] = j.h[2n+1] * k.h[2n+1] multiply lower halves unsigned immediate (immediate sign-extends)

    mpyui i, j, s10 ; i.w[n] = j.h[2n+1] * ext(s10)

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    FI: Integer Multiplies

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    49/62

    multiply lower halves, add word

    mpya i, j, k, l ; i.w[n] = j.h[2n+1] * k.h[2n+1] + l.w[n] multiply lower halves, shift result down 16 with sign extend mpys i, j, k ; i.w[n] = j.h[2n+1] * k.h[2n+1] >> 16 multiply upper half j by lower half k, shift up 16 mpyh i, j, k ; i.w[n] = j.h[2n] * k.h[2n+1]

  • 8/3/2019 Intro Spu Optimizations Part 1

    50/62

    multiply upper halves signed

    mpyhh i, j, k ; i.w[n] = j.h[2n] * k.h[2n] multiply upper halves unsigned mpyhhu i, j, k ; i.w[n] = j.h[2n] * k.h[2n] multiply/accumulate upper halves mpyhha i, j, k ; i.w[n] += j.h[2n] * k.h[2n] multiply/accumulate upper halves unsigned

    mpyhhau i, j, k ; i.w[n] += j.h[2n] * k.h[2n]

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    FI: Conversions and FI instruction

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    51/62

    fi a, b, c ; use after frest or frsqest

    cuflt a, j, precis ; unsigned int to floatcsflt a, j, precis ; signed int to floatcfltu i, b, precis ; float to unsigned intcflts i, b, precis ; float to signed int

    Here precis is the precision as an immediate, so that e.g.

    cuflt fp, val, 8; converts 0x80 into 0.5

    Also, please note that these instructions saturate to the min and max values of

    their precision.

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    Byte Operations (BO) [Even: 4]

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    52/62

    Theres a couple of interesting instructions that help with multi-media andstreaming logic.

    Count Ones in Bytescntb i, j ; i.b[n] = numOneBits( j.b[n] )Average Bytesavgb i, j, k ; i.b[n] = ( j.b[n] + k.b[n] + 1 ) / 2Absolute Difference in Bytesabsdb i, j, k ; i.b[n] = abs( j.b[n] - k.b[n] )Sum Bytes into Half-wordssumb i, j, k ; i.h[0] = k.b[0] + k.b[1] + k.b[2] + k.b[3];

    i.h[1] = j.b[0] + j.b[1] + j.b[2] + j.b[3];:

    i.h[6] = k.b[12] + k.b[13] + k.b[14] + k.b[15];i.h[7] = j.b[12] + j.b[13] + j.b[14] + j.b[15];

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    Branch Class (BR) [Odd:-]

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    53/62

    Branches on the SPU are costly. If a branch is taken, and it has not beenpredicted , there is a 18 cycle penalty so that the chip can restart the pipe.There is no penalty for falling through a non-predicted branch. However, if youhave predicted a branch, and this does not occur - then there is also a 18 cycle

    penalty. Branches and branch hints are all odd instructions.Note: Even a static branch needs to be predicted.

    Note: This is one of the reasons why diverging control-paths are so difficult tooptimize for.

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    BR: Unconditional Branches

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    54/62

    Branch Relativebr brTo ; goto label addressBranch Relative and Set Linkbrsl i, brTo ; gosub label address, i.w[0] = return address, (*)Branch Indirectbi i ; goto i.w[0]Branch Indirect and Set Link

    bisl i, j ; gosub j.w[0], i.w[0] = return address, (*)BRanch Absolutebra brTo ; goto brToBRanch Absolute and Set Linkbrasl i, brTo ; gosub label address, i.w[0] = return address (*)

    (*): These instructions have a 4 cycle latency for the return register. Note:The bi instructions have enable/disable interrupt versions, e.g.: bie , bid ,bisle , bisld .

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    BR: Conditional Branches (Relative)

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    55/62

    Branch on Zerobrz i, brTo ; branch if i.w[0] == 0Branch on Not Zerobrnz i, brTo ; branch if i.w[0] != 0Branch on Zerobrhz i, brTo ; branch if i.h[1] == 0Branch on Not Zero

    brhnz i, brTo ; branch if i.h[1] != 0

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    BR: Conditional Branches (Indirect)

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    56/62

    Branch Indirect on Zerobiz i, j ; branch to j.w[0] if i.w[0] == 0Branch Indirect on Not Zerobinz i, j ; branch to j.w[0] if i.w[0] != 0Branch Indirect on Zerobihz i, j ; branch to j.w[0] if i.h[1] == 0Branch Indirect on Not Zero

    bihnz i, j ; branch to j.w[0] if i.h[1] != 0

    Note: These instructions can enable/disable interrupts as well.

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    BR: Interrupt & Misc

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    57/62

    Interrupt RETurniret i ; Return from interruptInterrupt RETurniretd i ; Return from interrupt, disable interruptsInterrupt RETurnirete i ; Return from interrupt, enable interruptsBranch Indirect and Set Link if External Data

    bisled i, j ; gosub j if channel 0 is non-zero

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    Hints Branch Class (HB) [Odd:15]

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    58/62

    If you know the most likely (or only) outcome for a branch, you can make surethe branch is penalty free as long as the hint occurs at least 15 cycles beforethe branch is taken. If the hint occurs later, there still may be a benet, sincethe penalty is lowered. However, if the hint arrives less than 4 cycles before thebranch, there is no benet.

    Please note that it also turns out that there is a hardware bug w.r.t. the hbrinstructions. One cannot hint a branch where the branch targets forwards andis also within the same 64-byte block as the branch.

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    Hints Branch Instructions

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    59/62

    Hint Branch (Immediate)hbr brFrom, j ; branch hint for any BIxxx type branchHint Branch Absolutehbra brFrom, brTo ; branch hint for any BRAxxx type branchHint Branch Relativehbrr brFrom, brTo ; branch hint for any BRxxx type branchHint Branch Prefetch

    hbrp ; inline prefetch code (*)

    (*) allows 15 LS instructions in a row without any instruction fetch stall.

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    60/62

    DP: Double Precision

  • 8/3/2019 Intro Spu Optimizations Part 1

    61/62

    DP instructions have a latency of 13 and are even. However, they will stallpipelining for 6 cycles (that is all currently executing instructions are halted)

    while this instruction is executed. Therefore, we do not recommend usingdouble precision at all!

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    Questions?

    http://goforward/http://find/http://goback/
  • 8/3/2019 Intro Spu Optimizations Part 1

    62/62

    Thats all folks!

    Pal-Kristian Engstad pal [email protected] Introduction to SPU Optimizations

    http://goforward/http://find/http://goback/