Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.

download Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.

If you can't read please download the document

Transcript of Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.

Intel Multimedia Extensions and Hyper-Threading Michele Co CS451 Outline Evolution of Intel multimedia extensions x87 (386) MMX (Pentium MMX, Pentium II) SSE (Pentium III) SSE2 (Pentium 4 Willamette) SSE3 (Pentium 4 Prescott) Hyper-Threading X87 FPU 8 80-bit data registers (double extended precision floating point) Data registers treated as a stack Control register FP precision, rounding, Status register FPU busy, TOS, CC, error, exception, Tag register- (2 bits) valid, zero, special, empty Last instruction pointer register Last data (operand) pointer register Opcode register x87 FPU State X87 Data Types x87 Instructions Data transfer (load, store, move) Basic arithmetic Comparison Transcendental (trigonometric, log, exp) Load constant x87 FPU control MMX SIMD execution 8 64-bit data registers (MMX) Aliased to x87 FPU registers Randomly accessible SIMD Execution MMX State MMX Registers MMX Data Types MMX Instructions Data transfer Arithmetic Comparison Conversion Unpacking Logical Shift Empty MMX state SSE Pentium III bit data registers (XMM) Independent of x87 FPU and MMX registers SSE instructions can be executed in parallel with MMX/x87 MXCSR register control and status for XMM registers (similar to x87 status register) EFLAGS register results of compare ops 128-bit packed single-precision fp data type Prefetching, cacheability, store ordering control instructions SSE State XMM Registers SSE Data Type SSE Instructions Packed and scalar single-precision floating point Logical Conversion 64-bit SIMD integer MXCSR management State management Cacheability control, prefetch, memory ordering SFENCE (store fence) FXSAVE, FXRSTORE extension of x87 fast save and restore of x87, MMX registers to also include save/restore of XMM, MXCSR registers Packed Single-Precision FP Operation Scalar Single-Precision FP Operation Shuffle Unpack and Interleave SSE2 Pentium 4 More data types More instructions to support new data types SSE2 State SSE2 Data Types SSE2 Instructions Support for additional types CLFLUSH (cache line flush) LFENCE (load fence) MFENCE (load + store fence) Packed Double-Precision FP Operations Scalar Double-Precision FP Operations SSE3 Pentium 4 (Prescott) Support for Hyper-Threading 13 new instructions 10 SIMD support instructions 1 x87 accelerating instruction (fp to int conversion) Synchronization of threads MONITOR (monitor write-back stores) MWAIT (wait for write-back store) No new state Asymmetric Processing Horizontal Data Movement Hyper-Threading Terminology Process Program associated with a context (state: registers, program counter, flags, etc.) Consists of one or more threads Thread lightweight process (less state) Hyper-threading Single physical processor appears as 2 logical processors Thread Level Parallelism (TLP) Many applications have software threads that can be executed simultaneously Online transaction processing Web services Latency can leave execution units idle Cache misses Branch mispredictions Waiting for loads/stores Techniques for Minimizing Effect of Long Latency Chip multiprocessing (CMP) 2 processors on single die Larger than single core chip, manufacture more expensive Time-slice or switch-on-event multithreading Switch threads after fixed time period or on long latency events like cache misses Doesnt take advantage of other sources of inefficient resource usage (branch mispredictions, instruction dependencies, etc.) Simultaneous multithreading (SMT) Multiple threads execute on single processor without switching Hyper-Threading is Intels implementation Intel Hyper-Threading Demo Resource Requirements for HT Need to maintain 2 contexts Replicated Register renaming logic (RAT) Instruction Pointer ITLB Return stack predictor Various other architectural registers (GP, control, APIC, machine state) Partitioned Re-order buffers (ROBs) Load/Store buffers Various queues, like the scheduling queues, uop queue, etc. Shared Caches: trace cache, L1, L2, L3, microcode ROM Microarchitectural registers Execution Units Hyper-Threading Goals Minimize die area cost for implementing Ensure forward progress by at least one logical processor Maintain single-threaded performance Frontend Changes 2 PCs Arbitration for shared resource access Trace cache, microcode ROM, caches One logical processor at a time per structure Thread tags per trace cache entry Microcode ROM 2 microcode instruction pointers Wider pipeline latches to hold state for 2 contexts Branch prediction RAS and branch history buffer duplicated Global history shared, but tagged with logical processor ID Trace Cache Hit Trace Cache Miss Hyper-threaded Execution Execution Modes Single-task (ST), Multi-task (MT) ST0, ST1 HALT: transitions ST modes depending on logical processor executing Interrupt sent to halted processor transitions to MT HT Performance - OLTP HT Performance Web Server