Intel’s MMX

43
Intel’s MMX Dr. Richard Enbody CSE 820

description

Intel’s MMX. Dr. Richard Enbody CSE 820. Why MMX?. Make the Common Case Fast Multimedia and Communication consume significant computing resources. Providing specific hardware support makes sense. Goals. accelerate multimedia and communications applications. - PowerPoint PPT Presentation

Transcript of Intel’s MMX

Page 1: Intel’s MMX

Intel’s MMX

Dr. Richard Enbody

CSE 820

Page 2: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Why MMX?

Make the Common Case Fast

• Multimedia and Communication consume significant computing resources.

• Providing specific hardware support makes sense.

Page 3: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Goals

• accelerate multimedia and communications applications.

• maintain full compatibility with existing operating systems and applications.

• exploit inherent parallelism in multimedia and communication algorithms

• includes new instructions and data types to improve performance.

Page 4: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

First Step: examine code

• Examined a wide range of applications: graphics, MPEG video, music synthesis, speech compression, speech recognition, image processing, games, video conferencing.

• Identified and analyzed the most compute-intensive routines

Page 5: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Common Characteristics

• Small integer data types: e.g. 8-bit pixels, 16-bit audio samples

• Small, highly repetitive loops

• Frequent multiply-and-accumulate

• Compute-intensive algorithms

• Highly parallel operations

Page 6: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

MMX Technology

A set of basic, general purpose integer instructions:

• Single Instruction, Multiple Data (SIMD)

• 57 new instructions

• Eight 64-bit wide MMX registers

• Four new data types

Page 7: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Data Types

Page 8: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Data Types

Page 9: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Example

• Pixels are generally 8-bit integers. Pack eight pixels into a 64-bit MMX register.

• An MMX instruction takes all eight of the pixels at once from the MMX register, performs the arithmetic or logical operation on all eight elements in parallel, and writes the result into an MMX register.

Page 10: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Compatibility

• No new exceptions or states are added.

• Aliases to existing FP registers:The exponent field of the corresponding floating-point register (bits 64-78) and the sign bit (bit 79) are set to ones (1's), making the value in the register a NaN (Not a Number) or infinity when viewed as a floating-point value.

Page 11: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Page 12: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

57 Instructions• Basic arithmetic: add, subtract, multiply,

arithmetic shift and multiply-add • Comparison• Conversion: pack & unpack• Logical• Shift• Move: register-to-register• Load/Store: 64-bit and 32-bit

Page 13: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Packed Add Word with wrap around

•Each Addition is independent•Rightmost overflows and wraps around

Page 14: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Saturation

• Saturation: if addition results in overflow or underflow, the result is clamped to the largest or smallest value representable.

• This is important for pixel calculations where this would prevent a wrap-around add from causing a black pixel to suddenly turn white

Page 15: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

No Mode

There is no "saturation mode bit”:a new mode bit would require a change to the operating system. Separate instructions are used to generate wrap-around and saturating results.

Page 16: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Packed Add Word with unsigned saturation

•Each Addition is independent•Rightmost saturates

Page 17: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Multiply-Accumulate

multiply-accumulate operations are fundamental to many signal processing algorithms like vector-dot-products, matrix multiplies, FIR and IIR Filters, FFTs, DCTs etc

Page 18: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Packed Multiply-Add

Multiply bytes generating four 32-bit results.Add the 2 products on the left for one result and the 2 products on the right for the other result.

Page 19: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Packed Parallel Compare

• No new condition code flags

• No existing IA condition code flags are affected by this instruction.

• Result can be used as a mask to select elements from different inputs using a logical operation, eliminating branchs.

Page 20: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Packed Parallel Compare

Page 21: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Pack/Unpack

• Important when an algorithm needs higher precision in its intermediate calculations, as in image filtering.

• For example, image filtering involves a set of intermediate multiply operations between filter coefficients and a set of adjacent image pixels, accumulating all the values together.

Page 22: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Pack

Page 23: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Conditional Select

The Chroma Keying example demonstrates how conditional selection using the MMX instruction set removes branch mis-predictions, in addition to performing multiple selection operations in parallel. Text overlay on a pix/video background, and sprite overlays in games are some of the other operations that would benefit from this technique.

Page 24: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Chroma Keying

Page 25: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Chroma Keying (con’t)

• Take pixels from the picture with the woman on a green background.

• A compare instruction builds a mask for that data. That mask is a sequence of bytes that are all ones or all zeros.

• We now know what is the unwanted background and what we want to keep.

Page 26: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Create Mask

Assume pixels alternate green/not_green

Page 27: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Combine: !AND, AND, OR

Page 28: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Branch Removal

Without MMX technology, each pixel is processed separately and requires a conditional branch. Using MMX instructions, eight 8-bit pixels can be processed in parallel and no conditional branches are involved.

Page 29: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Vector Dot Product

• The vector dot product is one of the most basic algorithms used in signal-processing of natural data such as images, audio, video and sound.

• PMADD does 4 multiplies and 2 adds at a time. Coupled with PADD, eight multiply-accumulate operations can be performed: 2 PMADD and 2 PADD

Page 30: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Vector Dot Product

Page 31: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Vector Dot Product

Page 32: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Vector Dot ProductAssuming precision is sufficient, a dot-

product on an 8-element vector can be completed using 8 MMX instructions: 2 PMADDs, 2 PADDs, two shifts (if needed to fix the precision after the multiply), and 2 loads for one of the vectors (the other vector is loaded by the PMADD instruction which can have one of its operands come from memory).

Page 33: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

W/O MMX WITH MMX

Load 16 4

Multiply 8 2

Shift 8 2

Add 7 1

Misc -- 3

Store 1 1

Total 40 13

Compare

Page 34: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Compare

• With MMX technology, one third of the number of instructions is needed.

• Most MMX instructions can be executed in one clock cycle, so the performance improvement will be more dramatic than the simple ratio of instruction counts.

Page 35: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Matrix Multiply

3D games: computations that manipulate 3D objects use 4-by-4 matrices that are multiplied with 4-element vectors many times. Each vector has the X,Y, Z and perspective corrective information for each pixel. The 4-by-4 matrix is used to rotate, scale, translate and update the perspective corrective information for each pixel.

Page 36: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Page 37: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Compare

W/O MMX WITH MMX

Load 32 6

Multiply 16 4

Add 12 2

Misc 8 12

Store 4 4

Total 72 28

Page 38: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Matrix Multiply

• MMX required half the instructions.

Page 39: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Image Dissolve Using Alpha Blending

• Dissolve a Swan into a FlowerResult_pixel =

Flower_pixel * (alpha/255) + Swan_pixel * [1 - (alpha/255)]

• Assume 640x480 resolution

Page 40: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Dissolve: Millions of Inst.W/O MMX WITH MMX

Load 470 117

Unpack -- 117

Multiply 470 117

Add 235 58

Pack -- 58

Store 235 58

Total 1,400 525

Page 41: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Dissolve

1 billion fewer instructions for the 640x480 dissolve

Page 42: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Page 43: Intel’s MMX

Michigan State UniversityComputer Science and Engineering

Conclusion

• MMX appeared in 1997 in Pentium processors (with bigger cache).

• According to Intel, an MMX microprocessor runs a multimedia application up to 60% faster.In addition, it runs other applications about 10% faster