AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)

AMD Opteron Overview

Michael Trotter (mjt5v)Tim Kang (tjk2n)

Jeff Barbieri (jjb3v)

Introduction• AMD Opteron

– Focuses on Barcelona

• Barcelona is AMD’s 65nm 4-core CPU

Fetch

• Fetches 32B from L1 cache to pre-decode/Pick buffer• For simplicity, the Barcelona uses pre-decode

information to mark the end of an instruction.

Inst. Decode• The instruction cache contains a pre-decoder which

scans 4B of the instruction stream each cycle– Inserts pre-decode information from the ECC bits of the

L1I, L2 and L3 caches, along with each line of instructions• Instructions are then passed through the sideband

stack optimizer– x86 includes instructions to directly manipulate the stack

of each thread– AMD introduced a side-band stack optimizer to remove

these stack manipulations from the instruction stream– Thus, many stack operations can be processed in parallel

• Frees up the reservation stations, re-order buffers, and regular ALUs for other work

Branch Prediction• Branch selector chooses between a bi-modal predictor and a global

predictor– The bi-modal predictor and branch selector are both stored in the ECC

bits of the instruction cache, as pre-decode information– The global predictor combines the relative instruction pointer (RIP) for

a conditional branch with a global history register • Tracks last 12 branches with a 16K entry prediction table containing 2 bit

saturating counters – The branch target address calculator (BTAC) checks the targets for

relative branches• Can correct mis-predictions with a two cycle penalty.

• Barcelona uses an indirect predictor– Specifically designed to handle branches with multiple targets (e.g.

switch or case statements)• Return address stack has 24 entries

Pipeline

• Uses a 12 stage pipeline

OO (ROB)• The Pack Buffer (post-decoding buffer) sends

groups of 3 micro-ops to the re-order buffer (ROB)– The re-order buffer contains 24 entries, with 3 lanes

per entry• Holds a total of 72 instructions

– Instructions can be moved between lanes to avoid a congested reservation station or to observe issue restrictions

• From the ROB, instructions issue to the appropriate scheduler

Integer Future File and Register File (IFFRF)

• The IFFRF contains 40 registers broken up into three distinct sets– The Architectural Register File

• Contains 16x64 bit non-speculative registers• Instructions can only modify the Architectural Register File until they are

committed– Speculative instructions read from and write to the Future File

• Contains the most recent speculative state of the 16 architectural instructions

– The last 8 registers are scratchpad registers used by the microcode. • Should a branch mis-prediction or an exception occur, the pipeline

rolls back, and architectural register file overwrites the contents of the Future File

• There are three reservation stations, i.e. schedulers, within the integer cluster – Each station is tied to a specific lane in the ROB and holds 8

instructions

Integer Execution

• Barcelona uses three symmetric ALUs which can execute almost any integer instruction

• Three full featured ALUs require more die area and power• Can provide higher performance for certain edge

cases• Enables a simpler design for the ROB and

schedulers.

Floating Point Execution

• Floating Point operations are first sent to the FP Mapper and Renamer

• In the Renamer, up to 3 FP instructions each cycle are assigned a destination register from the 120 FP register file entries.

• Once the micro-ops have been renamed, they may be issued to the three FP schedulers

• Operands can be obtained from either the FP register file, or the forwarding network

Floating Point Execution (SIMD)

• The FPUs are 128 bits wide so that Streaming SIMD Extension (SSE) instructions can execute in a single pass.

• Similarly, the load-store units, and the FMISC unit load 128 bit wide data, to improve SSE performance.

Memory Overview

Memory Hierarchy

• 4 separate 128KB 2-way set associative L1 cache– Latency = 3 cycles– Write-back to L2– The data paths into and from the L1D cache also

widened to 256 bits (128 bits transmit and 128 bits receive)

• 4 separate 512KB 16-way set associative– Latency = 12 cycles– Line size is 64B

L3 Cache• Shared 2MB 32-way set associative L3

– Latency = 38 cycles– Uses 64B lines– The L3 cache was designed with data sharing in mind

• When a line is requested, if it is likely to be shared, then it will remain in the L3– This leads to duplication which would not happen in an exclusive hierarchy

• In the past, a pseudo-LRU algorithm would evict the oldest line in the cache. – In Barcelona’s L3, the replacement algorithm has been changed to prefer evicting unshared lines

• Access to the L3 must be arbitrated since the L3 is shared between four different cores– A round-robin algorithm is used to give access to one of the four cores each cycle.

• Each core has 8 data prefetchers (a total of 32 per device)– Fill the L1D cache– Can have up to 2 outstanding fetches to any address

Memory Controllers• Each memory controller supports independent 64B

transactions• Integrated DDR2 Memory controller ensures that L3 cache

miss is resolved in less than 60 nanoseconds

TLB• Barcelona offers non-speculative memory access re-ordering in the form of Load

Store Units (LSU)– Thus, some memory operations can be issued out-of-order

• In the 12 entry LSU1, the oldest operations translate their addresses from the virtual address space to the physical address space using the L1 DTLB

• During this translation, the lower 12 bits of the load operation’s address are tested against previously stored addresses– If they are different, then the load proceeds ahead of the store– If they are the same, load-store forwarding occurs

• Should a miss in the L1 DTLB occur, the L2 DTLB will be checked– Once the load or store has located address in the cache, the operation will move on to LSU2.

• LSU2 holds up to 32 memory accesses, where they stay until they are removed– The LSU2 handles any cache or TLB misses via scheduling and probing– In the case of a cache miss, the LSU2 will then look in the L2, L3 and then memory– In the case of TLB misses, it will look in the L2 TLB and then main memory– The LSU2 also holds store instructions, which are not allowed to actually modify the caches

until retirement to ensure correctness– Thus, the LSU2 reduces the majority of the complexity in the memory pipeline

Hypertransport

• Barcelona has four HyperTransport 3.0 lanes for inter-processor communications and I/O devices

• HyperTransport 3.0 adds a feature called ‘unganging’ or lane-splitting

• The HT3.0 links are composed of two 16 bit lanes ( in both directions)– Each can be split up into a pair of independent 8-

bit wide links

Shanghai

• The latest model of the Opteron series• Several improvements over Barcelona– 45nm– 6MB L3 cache– Improved clock speeds– A host of other improvements

AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)

Documents

Transcript of AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)