ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos...

34
ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX The Basics Instruction Set – Examples Integration into Pentium Relationship to vector ISAs AMD’s 3DNow! Intel’s ISSE (a.k.a. KNI)

Transcript of ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos...

Page 1: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

Multimedia ISA Extensions

• Intel’s MMX– The Basics– Instruction Set– Examples– Integration into Pentium – Relationship to vector ISAs

• AMD’s 3DNow!• Intel’s ISSE (a.k.a. KNI)

Page 2: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

MMX: Basics

• Multimedia applications are becoming popular

• Are current ISAs a good match for them?• Methodology:

– Consider a number of “typical” applications– Can we do better?– Cost vs. performance vs. utility tradeoffs

• Net Result: Intel’s MMX

• Can also be viewed as an attempt to maintain market share– If people are going to use these kind of

applications we better support them

Page 3: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

Multimedia Applications

• Most multimedia apps have lots of parallelism:– for I = here to infinity

• out[I] = in_a[I] * in_b[I]– At runtime:

• out[0] = in_a[0] * in_b[0]• out[1] = in_a[1] * in_b[1]• out[2] = in_a[2] * in_b[2]• out[3] = in_a[3] * in_b[3]• …..

• Also, work on short integers:– in_a[i] is 0 to 256 for example (color)– or, 0 to 64k (16-bit audio)

Page 4: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

Observations

• 32-bit registers are wasted– only using part of them and we know– ALUs underutilized and we know

• Instruction specification is inefficient– even though we know that a lot of the same

operations will be performed still we have to specify each of the individually

– Instruction bandwidth – Discovering Parallelism– Memory Ports?

• Could read four elements of an array with one 32-bit load

• Same for stores• The hardware will have a hard time

discovering this– Coalescing and dependences

Page 5: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

MMX Contd.

• Can do better than traditional ISA– new data types– new instructions

• Pack data in 64-bit words– bytes– “words” (16 bits)– “double words” (32 bits)

• Operate on packed data like short vectors (arrays)

Page 6: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

MMX:Example

Up to 8 operations (64bit) go in parallel Potential improvement: 8x In practice less but still good

Besides another reason to think your machineis obsolete

Page 7: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

MMX Data Types

Page 8: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

A bit of History

• This is a special case of SIMD– Single Instruction– Multiple Data

• One instruction specifies that an operation should be applied:– Repeatedly– To possibly different data elements each

time– Each of these operations are independent

• Conventional ISA is SISD– Single Instruction/Single Data

• First used in Livermore S-1 (> 25 years)

Page 9: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

MMX: Instruction Set

• 57 new instructions• Integer Arithmetic

– add/sub/mul– multiply add– signed/unsigned– saturating/wraparound

• Shifts• Compare (form mask)• Pack/Unpack• Move

– from/to memory– from/to registers

Page 10: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

Arithmetic

• Conventional: Wrap-around– on overflow, wrap to -1– on underflow, wrap to MAXINT

• Think of digital audio– What happens when you turn volume to the

MAX?• Brightness in pictures• Saturating arithmetic:

– on overflow, stay at MAXINT– on underflow, stat at MININT

• Two flavors:– unsigned– signed

Page 11: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

Operations

• Mult/Add

• Compares

• Conversion

– Interpolation/Transpose– Unpack (e.g., byte to word)– Pack (e.g., word to byte)

Page 12: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

Examples

• Image Composting– A and B images fade-in and fade-out– A * fade + B * (1 - fade), OR– (A - B) * fade + B

• Image Overlay– Sprite: e.g., mouse cursor– Spite: normal colors + transparent– for i = 1 to Sprite_Length

• if A[I] = clear_color then– Out_frame[I] = C[I]– else Out_frame[I] = A[I]

• Matrix Transpose– Covert from row major to column major– Used in JPEG

Page 13: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

Matrix Transpose 4x4

• That’s for the first two rows

m33 m32 m31 m30 m13 m12 m11 m10

m23 m22 m21 m20 m03 m02 m01 m00

punpcklwd punpcklwd

m31 m21 m30 m20 m11 m01 m10 m00

punpckhdq punpckldq

m31 m21 m11 m01 m30 m20 m10 m00

m03 m02 m01 m00m13 m12 m11 m10m23 m22 m21 m20m33 m32 m31 m30

m30 m20 m10 m00m31 m21 m11 m01m33 m22 m12 m02m33 m23 m13 m03

Page 14: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

Chroma Keying

• for (i=0; i<image_size; i++) – if (x[i] == Blue) new_image[i] =y[i]– else new_image[i] = x[i];

Page 15: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

Chroma Keying Code

• Movq mm3, mem1 – Load eight pixels from persons’ image

• Movq mm4, mem2 – Load eight pixels from the background image

• Pcmpeqb mm1, mm3• Pand mm4, mm1• Pandn mm1, mm3• Por mm4, mm1

Page 16: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

Integration into Pentium

• Major issue: OS compatibility– Create new registers?– Share registers with FP

• Existing OSes will save/restore

• Use 64-bit datapaths• Pipe capable of 2 MMX IPC• Separate MEM and Execute stage

Page 17: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

“Recent” Multimedia Extensions

• Intel MMX: integer arithmetic only• New algorithms -> new needs• Need for massive amounts of FP ops• Solution? MMX like ISA but for FP not

only integer• Example: AMD’s 3DNow!

– New data type:• 2 packed single-precision FP

– 2 x 32-bits» sign + exponent + significant

– New instructions – Speedup potential: 2x

Page 18: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

AMD’s 3DNow!

• 21 new instructions• Average: motivated by MPEG• Add, Sub, Reverse Sub, Mul• Accumulate

– (A1, A2) acc (B1, B2) = (B1 + B2, A1 + A2)• Comparison (create mask)• Min, Max (pairwise)• Reciprocal and SQRT,

– Approximation: 1st step and other steps• Prefetch• Integer from/to FP conversion • All operate on packed FP data

– sign * 2^(mantissa - 127) * exponent

Page 19: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

Recent Extensions Cont.

• Intel’s ISSE– very similar to AMD’s 3DNow!– But has separate registers

• Lessons?– Applications change over time– Careful when introducing new instructions

• How useful are they?• Cost?• LEGACY: are they going to be useful in

the future?• Everyone has their own Multimedia

Instruction set these days– read handout

Page 20: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

Intel’s SSE

• Multimedia/Internet?• 70 new instructions• Major Types:

– SIMD-FP 128-bit wide 4 x 16 bit FP– Data movement and re-organization– Type conversion

• Int to Fp and vice versa• Scalar/FP precision

– State Save/Restore• New SSE registers not like MMX

– Memory Streaming• Prefetch to specified hierarchy level

– New Media• Absolute Diff, Rounded AVG, MIN/MAX

• SSE2: – SIMD-FP two 64-bit fp as 128-bit

Page 21: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

Altivec (PowerPC Mmedia Ext)

• 128-bit registers• 8, 16, or 32 bit data types• Scalar or single-precision FP• 162 Instructions• Saturation or Modulo arithmetic• Four operand Instructions

– 3 sources, 1 target

Page 22: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

Altivec Design Process

• Look at Mmedia Kernel• Justify new instructions• Video

– 8bit int LowQ, 16-bit int HighQ• Audio

– 16bit int LowQ, SP FP HighQ• Image Processing

– 8bit int LowQ, 16bit Int HighQ• 3D Graphics

– 16bit int LowQ, SP FP HighQ• Speech Recog.

– 16bit int Low Q, Sp FP HighQ• Communications/Crypto

– 8-bit or 16bit unsigned int

Page 23: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

Vector Processors

• Vector: – One-Dimensional array of numbers

• Original Motivation: – Scientific/Numerical Programs operate on vectors

• Parallelism Abound• Example:

– Do i = 1 to 64• C[I] = A[I] + B[I]

• Vector Processors• Registers are vectors• Operations are element-wise across multiple

vectors• Example:

– addv Rc, Ra,Rb

Page 24: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

Vector Example• Do i = 1 to 64

C[I] = A[I] + B[I]

• addv rc, ra, rb

a[0] b[0]+c[0] =

a[1] b[1]+c[1] =

a[2] b[2]+c[2] =

a[63] b[63]+c[63] =

Page 25: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

Why Vector Processors?

• Deeper Pipelines faster Clock Higher Performance

• BUT!– Interlock logic becomes really complicated

as pipeline deepens– Bubbles due to data deps increase

• Want Wider Machines to exploit Parallelism

• BUT!– Increasingly Harder to increase issue width

• Finally Recall Fetch and Issue Bottleneck– Can’t execute more that you fetch/decode

Page 26: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

What’s Good About Vector Procs

• Vectors facilitate deeper Pipelines– No intra vector interlocks– No intra vector data deps– Inner loop control deps eliminated

• They were artificial to start with– Single Instruction for Multiple operations– Vector instruction provides information for

what the machine is going to be doing for a while

• Could exploit in memory system• Know that we are going to use 64

elements which are likely one after the other

Page 27: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

Vector Architectures

• Vectors in Memory– All vectors in memory– Long startup latency– Memory ports?– Good for long vectors

• Vectors in Registers– Load/store– Vector ops only on regs– Register ports less expensive than memory

ports– Good for small vectors also– Register Vector is the limiter

• Fact: in most applications vectors are short

• Hence Register Vectors better

Page 28: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

Vector ISA Example

• Vector-Vector Insts– VRC[i] = VRA[i] op VRB[i]

• Vector – Scalar Inst– VRB[i] = VRA[i] op CONST

• Vector Load/Store– Mem[i]= VRA[i]– W/ Stride

• M[r1 + i * r2] = VRA[i]– Indexed

• M[r1+ VRB[i]] = VRA[i]• Also called scatter/gather

• Support for shorter vectors– Vector Length Register

• Vector Masks– VRb[i] = op VRa[i] if (VRc [i])

Page 29: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

Vector Chaining

• C[i] = A[i] * B[i]• D[i] =C[i] + x

• MULTV VRC, VRA, VRB• ADDVI VRD, VRC, Rx

• VRDi add can be initiated as soon as MUTLVi finishes

• We do not have to wait for the whole MULTV to finish

Page 30: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

Vector Processors – A bit of History

• CRAY-1: started in ’72, completed in ’74• 12ns cycle time• 8 Scalar Registers• 8 Address Registers• 8 Vectors or 64 words• 64 Scalar and 64 Address temporaries• 12 Functional Units• 1Mword memory: 4 clock cycles

Page 31: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

Are Vectors Always a Win?

• From Gordon Bell’s talk

• Scalar is way better for short vectors• Vector 7x Scalar for larger vectors

Vector size

Tim

e/e

lem

en

t

Page 32: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

Cray-1 Architecture

Page 33: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

Vectors and SIMD

• Vector Length– Not programmable (no VL reg)– Must be multiple of 64 total bits

• Memory Load/Store– stride one only

• Arithmetic– Integer only

• Conditionals– builds byte mask– do both ways and choose– no trap problem -- no trapping instructions

• Data Movement– minimal– only pack/unpack

Page 34: ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

ECE 1773- Spring ‘02Some material © Hill, Sohi, Smith, Wood (UW-Madison)© A. Moshovos

Specifying Independence

• Vectors and SIMD are examples of “independence” ISAs

• Conventional ISA– One instruction after the other– No way of explicitly stating:

• Inst A and B are independent• Vectors and SIMD

– A series of many conventional instructions that are the same one vector or SIMD inst.

• Limited flexibility for specifying independence

• Still, these were optimized for the common case in a specific class of applications