Real Processor Architectures Now that we’ve seen the basic design elements for modern processors,...

37
Real Processor Architectures • Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors – We start with the 486 pipeline to see how NOT to do a pipeline • recall Intel x86 is a CISC with variable length instructions, memory-register addressing, some complex addressing modes and some complex instructions • we will compare it to the much more efficient MIPS pipeline – We then consider dynamic issue superscalars of varying degrees of sophistication – To understand the Pentium architecture, we must look at how they avoided the pitfalls of the 486 by issuing micro-code rather than machine instructions, so this requires that we also look at microprogrammed control units and micro-code

Transcript of Real Processor Architectures Now that we’ve seen the basic design elements for modern processors,...

Page 1: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

Real Processor Architectures• Now that we’ve seen the basic design elements for

modern processors, we will take a look at several specific processors– We start with the 486 pipeline to see how NOT to do a

pipeline• recall Intel x86 is a CISC with variable length instructions, memory-

register addressing, some complex addressing modes and some complex instructions

• we will compare it to the much more efficient MIPS pipeline

– We then consider dynamic issue superscalars of varying degrees of sophistication

– To understand the Pentium architecture, we must look at how they avoided the pitfalls of the 486 by issuing micro-code rather than machine instructions, so this requires that we also look at microprogrammed control units and micro-code

Page 2: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

486 Processor• The instruction set was almost identical to the 386 (and

thus was still CISC based)– They added a floating point functional unit to the processor

so that it could execute the floating point operations introduced for the x86 math coprocessor

– This FP functional unit provided a degree of parallel processing in that while FP operations were executed, the pipeline would continue to fetch and execute integer operations

– It contained an 8KB combined instruction/data cache (later expanded to 16KB)

• The big difference between the 386 and 486 though was the pipeline, the first Intel processor with a pipeline– However, because of the CISC nature of the instruction set,

the pipeline is not particularly efficient

Page 3: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

The 486 Pipeline• They used a 5 stage pipeline– Fetch 16 bytes worth of instruction

• this may be 1 instruction (or even a part of 1 instruction), or multiple instructions

– Decode stage 1 – was an entire instruction fetched? If not, this stage stalls• divide up the 16 bytes into instruction(s)

– Decode stage 2 – decode the next instruction, fetch operands from registers

– Execution – ALU operations, branch operations, cache (mov) instructions• this stage may take multiple cycles if an ALU operation requires

1 or more memory access (e.g., add x, 5 takes 2 memory accesses)

– Write result of load or ALU operation to register

Page 4: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

486 Difficulties• Stalls arise for numerous reasons– 17 byte long instructions require 2 instruction fetch stages– Any ALU memory-register or memory-immediate takes at

least 1 additional cycle, possibly two if the memory item was both a source and destination• such a situation stalls instructions in the decode 2 stage• or in the EX stage if the result is written back to memory

– Complex addressing modes can cause stalls• pointer accessing (indirect addressing) is available which takes 2

memory accesses• scaled addressing mode can involve both an add and a shift• again, stalls occur in the decode 2 stage

– Branch instructions have a 3 cycle penalty because branches are computed in the EX stage (4th stage) and some loop operations take more than 1 cycle to compute adding a further stall

Page 5: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

486 Examples• The first

example has three data movements with no penalties

• The second example has a data hazard requiring 1 cycle of stall

• The third example illustrates a branch penalty

Page 6: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

486 Overall Architecture

Page 7: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

ARM Cortex A-8 Processor• Dual-issue superscalar with static scheduling but dynamic issue

detection through a scoreboard – Up to 2 instructions per cycle

• 14-stage pipeline (see next slide)– Branch prediction performed by in the AGU (address generation unit)

using:• Dynamic branch prediction with 512-entry two-way set associative branch

target buffer• 4K global history buffer

– when branch target buffer misses, a prediction is obtained from the global history buffer

• 8-entry return stack– an incorrect branch prediction flushes the entire pipeline

– Instruction decode is 5 stages long and up to 2 instructions decoded per cycle• 8 bytes fetched from cache• if neither instruction is a branch, PC is incremented• stage 4 in this 5 stage mini-pipeline is the scoreboard and issue logic

Page 8: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

A-8 Pipeline

Page 9: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

A-8 Execution UnitEither instruction canGo to the load/storePipeline but notBoth

ALU pipe 1 is forsimple integer operations

Multiplies use ALU pipe0 and can accommodateup to 2 in one cycle

Structural hazards are rare because the compiler attempts to schedule pairs ofinstructions to not use the same instruction pipe at the same time

Data hazards are detected during decode by the scoreboard and may either stall bothinstructions or just the second of the pair, the compiler is responsible for attemptingto prevent such stalls (note that forwarding is only available from WB (E5) to E0

Page 10: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

A-8 PerformanceThe ideal CPI for theA-8 is .5 (2 instructionsissued per cycle)

Here, you see the truthis that the ideal is notpossible and that asidefrom the mcf and gzipbenchmarks, the greatestsource of stalls arisebecause of the pipelinestalling (not becauseof cache misses)

Page 11: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

Pentium Architecture• Recall our examination of the Intel 486 pipeline– variable length of instructions, variable complexity of

operations, memory-register ALU operations, etc led to poor performance

• In order to improve performance using RISC features, the Pentium architects had to rethink things – they were stuck with their CISC instruction set (for backward compatibility)– in CISC architectures, a machine instruction is first translated

into a sequence of microinstructions– each microinstruction is a lengthy string of 1s and 0s, each of

which refer to one control signal in the machine– there needs to be a process to translate each machine instruction

into microinstructions and execute each microinstruction – this is done by collecting machine instructions and their associated microinstructions into microprograms

Page 12: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

Why Microinstructions?• The Pentium architecture uses a microprogrammed

control unit– there is already a necessary step of decoding a machine

instruction into microcode• Now, consider each microinstruction:– equal length – executes in the same amount of time (unless hazards arise)– branches are at the microinstruction level and are more

predictable than machine language level branching• In a RISC architecture, the simplicity of each instruction

allows it to be carried out directly in hardware in 1 cycle (usually)– Intel realized that to efficiently pipeline their CISC architecture,

they had to pipeline the microinstructions instead of machine instructions

Page 13: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

Control and Micro-Operations• An example architecture

is shown to the right• Each of the various

connections is controlled by a particular control signal– MBR to the AC

controlled with signal C11

– PC to MAR by C2– AC to ALU C7

• note that this figure is incomplete

• A microprogram is a sequence of micro-operations

this is not an x86 architecture!

Page 14: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

Example• Consider a CISC instruction such as Add R1, X

– X copied into MAR and a memory read signaled– datum returned across data bus to MBR– adder sent values in R1 and MBR, adding the two, storing

result back into R1

• This sequence can be written in terms of micro-operations as:– t1: MAR (IR (address) )– t2: MBR Memory– t3: R1 (R1) + (MBR)– t3: Acc (R1) + (MBR)– t4: R1 (Acc)

• Each micro-operation is handled by one or more control signals– For instance, MBR Memory is C5

t1 – t5 are clock cycles, eachmicroinstruction executes inseparate clock cycles

Page 15: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

Control Memory

...Jump to Indirect or Execute

...Jump to Execute

...Jump to Fetch

Jump to Op code routine

...

Jump to Fetch or Interrupt

...

Jump to Fetch or Interrupt

Fetch cycle routine

Indirect Cycle routine

Interrupt cycle routine

Execute cycle begin

AND routine

ADD routine

Each micro-program consists of one or more micro-instructions, each stored in a separate entry of the control memory

The control memory itself is firmware, a program stored in ROM, that is placed inside of the control unit

Note: each micro-program ends with a branch to the Fetch, Interrupt, Indirect or Execute micro-program

Page 16: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

Example of Three Micro-Programs• Fetch:t1: MAR (PC) C2

t2: MBR Memory C0, C5, CR PC (PC) + 1 C* t3: IR (MBR) C4

• Indirect: t1: MAR (IR (address) ) C8t2: MBR Memory C0,

C5, CR t3: IR(address) (MBR (address) )

C4• Interrupt: t1: MBR (PC) C1

t2: MAR save address C* PC routine address C*t3: Memory (MBR)

C12, CW– CR – Read control to system bus– CW – write control to system bus

• C0 – C12 refers to the previous figure • C* are signals not shown in the figure

Page 17: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

Horizontal vs. Vertical Micro-Instructions

Micro-instruction Address

Function CodesJump

Condition

Internal CPU Control Signals Micro-instruction Address

Jump Condition

System BusControl Signals

Horizontal micro-instructions contain 1 bit for every control signal controlled by the control unit

Vertical micro-instructions use function codes that need additional decoding

This micro-instruction requires 1 bit for every control line, it is longer than the vertical micro-instruction and therefore takes more space to store, but does not require additional time to decode by the control unit

Micro-instruction address points to a branch in the control memory and is taken if the condition bit is true

Page 18: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

Micro-program

med Control

Unit

Page 19: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

Continued• Decoder analyzes IR– delivers starting address of op code’s micro-program in

control store • address placed in the to a micro-program counter (here, called

a Control Address Register)

• Loop on the following– sequencer signals read of control memory using address

in microPC– item in control memory moved to control buffer register– contents of control buffer register generate control

signals and next address information• if the micro-instructions are vertical, decoding is required here

– sequencer moves next address to control address register • next instruction (add 1 to current)• jump to new part of this microprogram• jump to new machine routine

Page 20: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

Pentium IV: RISC features• All RISC features are implemented at the

microinstructions level instead of machine instruction level as seen in typical RISC processors–Microinstruction-level pipeline – Dynamically scheduled micro-operations– Reservation stations (128) and multiple functional

units (7)– Branch speculation via branch target buffer• speculation at micro-instruction level, not machine level• instead of an ROB, decisions are made at the reservation

stations so that a miss-speculation causes reservation stations to flush their contents, correct speculation causes reservation stations to forward results to registers/store units

– Trace cache used (discussed shortly)

Page 21: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

Pentium Pipeline• Fetch machine instruction (3 stages)• Decode machine instruction into microinstructions (2 stages)• Superscalar issues multiple microinstructions (2 stages)

– register renaming occurs here, up to 3 microinstructions can be issued per cycle – 2 integer and 1 FP

• Execute of microinstructions (1 stage)– Functional units are pipelined and can take from 1 up to

approximately 32 cycles to execute• Write back (3 stages)• Commit (3 stages)

– up to 3 microinstructions can commit in any cycle

Page 22: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

Pentium IV Overall Architecture

Page 23: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

Specifications• 7 functional units:– 2 simple ALUs (add, compare, shift) – ½ cycle execution

to accommodate up to 2 micro-operations per cycle– 1 complex ALU (integer multiplication and division) –

multicycle, pipelined– 1 load unit and 1 store unit – including address

computation– 1 FP move (register to register move and convert)– 1 FP arithmetic unit (+, -, *, /) – multicycle, pipelined,

some SIMD execution permitted on these units

• 128 registers for renaming– reservation stations are used rather than a re-order buffer – instructions must wait in reservation stations longer than in

Tomasulo’s version, waiting for speculation results

Page 24: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

Trace Cache• The trace cache is an instruction cache– It caches not just individual instructions or even memory

refill lines, it caches blocks of instructions that have recently been executed together

• In this way, the trace cache encodes branch behavior implicitly

• Additionally, miss-speculated instructions would be discarded from a trace cache

• The trace cache was developed for the Pentium IV, so it stores microinstructions (not machine instructions)

• Combining a trace cache and branch target buffer together minimize microinstruction fetch and decoding– As long as the microinstructions remain in the trace cache– Miss-predictions at the microinstruction level is far rarer

than miss-predictions at the machine level

Page 25: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

Source of Stalls• This architecture is very complex and relies on

being able to fetch and decode instructions quickly– The process breaks down when• less than 3 instructions can be fetched in 1 cycle• trace cache causes a miss, or branches are miss

predicted• less than 3 instructions can be issued because they

either are not 2 int + 1 FP or because of structural hazards• limitation of reservation stations• data dependencies between functional units cause

stalls because other instructions have to wait at their reservation stations• data cache access results in a miss

Page 26: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

Continued• Stalls manifest themselves in two places– The issue stage • branch miss-predictions• cache misses• reservation stations full

– The commit stage • branch miss-predictions• instructions not ready to commit yet• these are not actually stalls, but because instructions are

committed in the order they were issued, a later instruction may wait to commit because of earlier instructions being time consuming, and if the later instruction is a branch, improperly fetched instructions because of miss-speculation may continue to occur• branch computation not yet available

Page 27: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

Continued• Miss-prediction rates (at the micro-operation

level) are very low– About .8% for integer benchmarks and .1% for

floating point benchmarks• notice how FP benchmarks continue to have high

predictability because they involve a lot of for loops which are very predictable, integer benchmarks tend to have more conditional statements which are less predictable

– At the machine language level, miss-speculation is between .1% and 1.5%

• Trace cache has nearly a 0% miss rate– The L1 and L2 data caches have miss rates of around

6% and .5% respectively– The machine’s effective CPI ranges from around 1.2 to

5.85 with an average of around 2.2 (machine instructions, not micro-operations)

Page 28: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

Earlier Pentiums• Pipeline changes:– Pentium pipeline: 2-issue superscalar, 5 stages– Pentium Pro pipeline: 14 stages– Pentium III pipeline 14 stages (shown earlier in these slides)– Pentium IV pipeline 21 stages (minimum) and eventually

widened to 31 (minimum)• Bus widened to support 64 GB• Conditional instructions introduced (we will cover this

next week)• Faster clock cycles introduced– From 1 GHz to 1.5 GHz, eventually up to 3.2 GHz

• the clock rate is so fast that it takes 2 complete cycles for an instruction or data to cross the chip

• Increased reservation stations–PIII: 40, PIV: 128• up to 128 instructions can become state of operation

simultaneously

Page 29: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

Pentium IV versus AMD Opteron• The Opteron uses dynamic scheduling,

speculation, a shallower pipeline, issue and commit of up to 3 instructions per cycle, 2-level cache, chip has a similar transistor count although is only 2.8 GHz

• The Opteron is a RISC instruction set, so instructions are machine instructions, not microinstructions– P4 has a higher CPI on all benchmarks except mcf • AMD is more than twice the P4 on this benchmark

– So for most cases, instructions take fewer cycles to complete (lower CPI) in the AMD than the P4 but the P4 has a slightly faster clock to offset this

Page 30: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

Intel Core i7• The i7 extends on the Pentium approach– Aggressive out of order speculation– Deep pipeline (14 stages)• instruction fetch – retrieves 16 bytes to decode• there is a separate IIFU that feeds a queue that can store

up to 18 instructions at a time– unlike the Pentium, decoding is done using a step called

macro-op fusion which combines instructions that have independent micro-ops that can execute in parallel

• if a loop is detected that contains fewer than 28 instructions or 256 bytes, these instructions will remain in a buffer to repeatedly be issued (rather than repeated instruction fetches)

Page 31: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

Continued• Instruction fetch also includes

– The use of a multilevel branch target buffer and a return address stack for speculation• miss-predictions cause a penalty of about 15 cycles

– A 32 KB instruction cache

• Decoding first converts machine instructions into microcode and breaks instructions into two types using four decoders– Simple micro-operation instructions (2 each)– Complex micro-operation instructions (2 each)

• Instruction issue can issue up to 6 micro-operations per cycle to– 36 centralized reservation stations – 6 functional units including 1 load and 2 store units that share

a memory buffer connected to 3 different data caches

Page 32: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

i7 Architecture

Page 33: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

i7 PerformanceCPI for various SPEC06Benchmarks

Average CPI is 1.06 forboth integer programsand .89 for FP

This is the number ofmachine instructions issued (not micro-ops)so obtaining the values isnot completely transparent

The Pentium and i7 are both susceptible to miss-speculation, that results in “wasted”work, up to 40% of the total work that goes into Spec 06 benchmarks is wasted

Waste also arises from cache misses (10 cycles or more lost with an L1 miss, 30-40 forL2 misses and as much as 135 for L3 misses)

Page 34: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

Multicore Processors• With additional space on the chip, the

strategy today is to equip the processor with multiple cores– Each core is a separate processor with its

own local cache, local bus, etc– An additional cache is commonly added to

the chip so that there is an L1 (within each core), L2 (on the chip, shared among cores) and L3 (off chip)

• We will briefly consider multicore processors later when we consider thread level processor and true parallel processing

• We wrap up our examination of processors by looking multicore performances as number of cores increase

Page 35: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

Multicore Performance

• Three things are apparent when considering the performance of the multi-core processors– First, obviously, the IBM Power 7 outperforms the other two in

every case– The speedup is close to but not always linear to the number of

cores• doubling the number of cores does not guarantee twice the performance

– There is a greater potential for speedup on FP benchmarks for the Power7 than on the int benchmarks

Page 36: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

A Balancing Act• Improving one aspect of our processor does not

necessarily improve performance– in fact, it might harm performance

• consider lengthening the pipeline depth and increasing clock speed in the P4 without adding reservation stations or using the trace cache

• stalls will arise at the issue stage thus defeating any benefit from the longer pipeline

• cache misses will have a greater impact, not a lesser impact, when the clock speed is increased

• Modern processor design takes a lot of effort to balance out the factors– without accurate branch prediction and speculation

hardware, stalls from miss-predicted branches will drop performance greatly • we saw this in both the ARM and i7 processors

Page 37: Real Processor Architectures Now that we’ve seen the basic design elements for modern processors, we will take a look at several specific processors –

Continued• As clock speeds increase– Stalls from cache misses create a bigger impact on CPI,

so larger caches and cache optimization techniques are needed

• To support multiple issue of instructions– we need a larger cache-to-processor bandwidth, which

can take up valuable space

• As we increase the number of instructions that can be issued– we need to increase the number of reservation stations

and reorder buffer size

• Some compiler optimizations can also be applied to help support the distributed nature of the hardware (we look at this next week)