Download - Review: Compiler techniques for parallelism Loop …kubitron/courses/cs152-F99/... · ° Key idea of Scoreboard: Allow instructions behind stall ... The Big Picture: Where are We

10/25/99 ©UCB Fall 1999 CS152 / KubiatowiczLec16.1

CS152Computer Architecture and Engineering

Lecture 16

Dynamic Scheduling (Cont), Speculation, and ILP

October 25, 1999

John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

lecture slides: http://www-inst.eecs.berkeley.edu/~cs152/


Review: Compiler techniques for parallelism

° Loop unrolling ⇒ Multiple iterations of loop insoftware:

• Amortizes loop overhead over several iterations• Gives more opportunity for scheduling around stalls

° Software Pipelining ⇒ Take one instruction from eachof several iterations of the loop

• Software overlapping of loop iterations• Today will show hardware overlapping of loop iterations

° Very Long Instruction Word machines (VLIW) ⇒Multiple operations coded in single, long instruction

• Requires sophisticated compiler to decide whichoperations can be done in parallel

• Trace scheduling ⇒ find common path and schedulecode as if branches didn’t exist (+ add “fixup code”)

° All of these require additional registers


Review: Dynamic hardware for out-of-order execution° HW exploitation of ILP

• Works when can’t know dependence at compile time.• Code for one machine runs well on another

° Key idea of Scoreboard: Allow instructions behind stallto proceed (Decode => Issue instr & read operands)

• Enables out-of-order execution => out-of-order completion

• ID stage checked both for structural & data dependencies

• Original version didn’t handle forwarding.

• No automatic register renaming⇒stalls for WAR and WAW hazards

• Are these fundamental limitations??? (No)


° The Five Classic Components of a Computer

° Today’s Topics:• Recap last lecture

• Hardware loop unrolling with Tomasulo algorithm

• Administrivia

• Speculation, branch prediction

• Reorder buffers

The Big Picture: Where are We Now?

Control

Datapath

Memory

Processor

Input

Output


Another Dynamic Algorithm: Tomasulo Algorithm

° For IBM 360/91 about 3 years after CDC 6600 (1966)

° Goal: High Performance without special compilers

° Differences between IBM 360 & CDC 6600 ISA• IBM has only 2 register specifiers/instr vs. 3 in CDC 6600

• IBM has 4 FP registers vs. 8 in CDC 6600

• IBM has memory-register ops

° Why Study? lead to Alpha 21264, HP 8000, MIPS 10000,Pentium II, PowerPC 604, …


Tomasulo Algorithm vs. Scoreboard

° Control & buffers distributed with Function Units (FU) vs.centralized in scoreboard;

• FU buffers called “reservation stations”; have pending operands

° Registers in instructions replaced by values or pointersto reservation stations(RS); called register renaming ;

• avoids WAR, WAW hazards

• More reservation stations than registers, so can do optimizationscompilers can’t

° Results to FU from RS, not through registers, overCommon Data Bus that broadcasts results to all FUs

° Load and Stores treated as FUs with RSs as well

° Integer instructions can go past branches, allowingFP ops beyond basic block in FP queue


Tomasulo Organization

��

��

��

��

��

��

��

��

�� !��

"��##��

��##��

"��"��"��"��$"��%"��&


Reservation Station Components

Op: Operation to perform in the unit (e.g., + or –)

Vj, Vk: Value of Source operands• Store buffers has V field, result to be stored

Qj, Qk: Reservation stations producing source registers(value to be written)• Note: No ready flags as in Scoreboard; Qj,Qk=0 => ready• Store buffers only have Qi for RS producing result

Busy: Indicates reservation station or FU is busy

Register result status—Indicates which functional unitwill write each register, if one exists. Blank when nopending instructions that will write that register.


Three Stages of Tomasulo Algorithm

1. Issue—get instruction from FP Op Queue If reservation station free (no structural hazard),

control issues instr & sends operands (renames registers).

2. Execution—operate on operands (EX) When both operands ready then execute;

if not ready, watch Common Data Bus for result

3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting units;

mark reservation station available

° Normal data bus: data + destination (“go to” bus)

° Common data bus: data + source (“come from” bus)• 64 bits of data + 4 bits of Functional Unit source address

• Write if matches expected Functional Unit (produces result)

• Does the broadcast


Tomasulo Example

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 Load1 NoLD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

0 FU


Tomasulo Example Cycle 1


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2




1 FU Load1



Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2




2 FU Load2 Load1

��




Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2


Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult2 No


3 FU Mult1 Load2 Load1

� �� !�"#��$��%��

� "��&�� '��'��(��"��&)




Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6ADDD F6 F8 F2


Time Name Busy Op Vj Vk Qj QkAdd1 Yes SUBD M(A1) Load2Add2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult2 No


4 FU Mult1 Load2 M(A1) Add1

� "��*�� '��'��(��"��&)




Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2


Time Name Busy Op Vj Vk Qj Qk2 Add1 Yes SUBD M(A1) M(A2)

Add2 NoAdd3 No

10 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1


5 FU Mult1 M(A2) M(A1) Add1 Mult2




Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2 6



Add2 Yes ADDD M(A2) Add1Add3 No



6 FU Mult1 M(A2) Add2 Add1 Mult2

� +��,---��$��%��)




Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7DIVD F10 F0 F6 5ADDD F6 F8 F2 6



Add2 Yes ADDD M(A2) Add1Add3 No



7 FU Mult1 M(A2) Add2 Add1 Mult2

� ,��&�� '��'��(��)




Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6


Time Name Busy Op Vj Vk Qj QkAdd1 No

2 Add2 Yes ADDD (M-M) M(A2)Add3 No



8 FU Mult1 M(A2) Add2 (M-M) Mult2




Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6










Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10







� ,��*�� '��'��(��)




Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11


Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 No



11 FU Mult1 M(A2) (M-M+M(M-M) Mult2

� .��(�,---��$��%��)� ,�/��0��1































Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11









Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11


Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1 No

40 Mult2 Yes DIVD M*F4 M(A1)


16 FU M*F4 M(A2) (M-M+M(M-M) Mult2



Faster than light computation(skip a couple of cycles)



Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11









Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56ADDD F6 F8 F2 6 10 11






� !��*�� '��'��(��)




Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56 57ADDD F6 F8 F2 6 10 11






� 2��+�3��3�(3��4��$



Instruction status: Read Exec Write Exec WriteInstruction j k Issue Oper Comp Result Issue Compl ResultLD F6 34+ R2 1 2 3 4 1 3 4

LD F2 45+ R3 5 6 7 8 2 4 5

MULTD F0 F2 F4 6 9 19 20 3 15 16

SUBD F8 F6 F2 7 9 11 12 4 7 8

DIVD F10 F0 F6 8 21 61 62 5 56 57

ADDD F6 F8 F2 13 14 16 22 6 10 11

� .�0��%��5�� )��6�7��"��(�(��'��

Compare to Scoreboard Cycle 62


Pipelined Functional Units Multiple Functional Units

(6 load, 3 store, 3 +, 2 x/÷) (1 load/store, 1 + , 2 x, 1 ÷)

window size: ~ 14 instructions ~ 5 instructions

No issue on structural hazard same

WAR: renaming avoids stall completion

WAW: renaming avoids stall issue

Broadcast results from FU Write/read registers

Control: reservation stations central scoreboard

Tomasulo v. Scoreboard (IBM 360/91 v. CDC 6600)


° Complexity• delays of 360/91, MIPS 10000, IBM 620?

° Many associative stores (CDB) at high speed

° Performance limited by Common Data Bus• Multiple CDBs => more FU logic for parallel assoc stores

Tomasulo Drawbacks


Administrivia

° Should be debugging Lab 5 by now!• Remember: a Working processor is necessary for full credit…

° Tomorrow: Sections are back in classroom

° More info on some of the things that we have beentalking about last two lectures:

• Computer Architecture: A Quantitative Approach by JohnHennesy and David Patterson

° Next: Memory systems• Start reading Chapter 7 (of your text) now…

• Lab 6 will be using memory systems.


Administrivia: Be careful about clock edges in lab5!

Exe

c

Reg

. F

ile

Mem

Acc

ess

Dat

aM

em

A

B

S

M

Reg

File

Equ

al

PC

Nex

t PC

IR

Inst

. Mem

Valid

IRex

Dcd

Ctr

l

IRm

em

Ex

Ctr

l

IRw

b

Mem

Ctr

l

WB

Ctr

l

D

° Since Register file has edge-triggered write:• Must have everything set up at end of memory stage• This means that “M” register here is not actual register!

° Same with edge-triggered memory ⇒ “D” register appears“inside” memory


Tomasulo Loop Example

Loop:LD F0 0 R1MULTD F4 F0 F2SD F4 0 R1SUBI R1 R1 #8BNEZ R1 Loop

° Assume Multiply takes 4 clocks

° Assume first load takes 8 clocks (cache miss),second load takes 1 clock (hit)

° To be clear, will show clocks for SUBI, BNEZ

° Reality: integer instructions ahead


Loop Example


ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 Load1 No1 MULTD F4 F0 F2 Load2 No1 SD F4 0 R1 Load3 No2 LD F0 0 R1 Store1 No2 MULTD F4 F0 F2 Store2 No2 SD F4 0 R1 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 No SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F300 80 Fu



ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 Load2 No1 SD F4 0 R1 Load3 No2 LD F0 0 R1 Store1 No2 MULTD F4 F0 F2 Store2 No2 SD F4 0 R1 Store3 No


Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 No SUBI R1 R1 #8Mult2 No BNEZ R1 Loop


Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F301 80 Fu Load1

Loop Example Cycle 1



ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 Load3 No2 LD F0 0 R1 Store1 No2 MULTD F4 F0 F2 Store2 No2 SD F4 0 R1 Store3 No


Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop


Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F302 80 Fu Load1 Mult1




ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 Store1 Yes 80 Mult12 MULTD F4 F0 F2 Store2 No2 SD F4 0 R1 Store3 No





° Implicit renaming sets up “DataFlow” graph



What does this mean physically?

addr: 80addr: 80

F0: Load 1 F0: Load 1

F4: Mult1 F4: Mult1

��

��

��

��

��

��

��

��

�� !��

"��##��"��"��"��"��$"��%"��&

R(F2) Load1mul

��##��

Addr: 80Addr: 80 Mult1Mult1








° Dispatching SUBI Instruction









° And, BNEZ instruction




ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult12 MULTD F4 F0 F2 Store2 No2 SD F4 0 R1 Store3 No





° Notice that F0 never sees Load from location 80




ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 No2 SD F4 0 R1 Store3 No


Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop



° Register file completely detached from iteration 1




ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No






° First and Second iteration completely overlapped


What does this mean physically?

addr: 80addr: 80addr: 72addr: 72

F0: Load2 F0: Load2

F4: Mult2 F4: Mult2

��

��

��

��

��

��

��

��

�� !��

"��##��"��"��"��"��$"��%"��&

R(F2) Load1mulR(F2) Load2mul

��##��

Addr: 80Addr: 80 Mult1Mult1Addr: 72Addr: 72 Mult2Mult2



ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No





° Load1 completing: who is waiting?° Note: Dispatching SUBI




ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 10 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No


Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

4 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop



° Load2 completing: who is waiting?° Note: Dispatching BNEZ




ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No



3 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #84 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop



° Next load in sequence










° Why not issue third multiply?













ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No






° Mult1 completing. Who is waiting?




ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R22 MULTD F4 F0 F2 7 15 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No


Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 No SUBI R1 R1 #8

0 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop



° Mult2 completing. Who is waiting?




ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R22 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 Store3 No








ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R22 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 Store3 Yes 64 Mult1








ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 18 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R22 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 Store3 Yes 64 Mult1








ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 18 19 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 No2 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 19 Store3 Yes 64 Mult1








ITER Instruction j k Issue Comp esult Busy Addr Fu1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 18 19 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 No2 MULTD F4 F0 F2 7 15 16 Store2 No2 SD F4 0 R1 8 19 20 Store3 Yes 64 Mult1







Why can Tomasulo overlap iterations of loops?

° Register renaming

• Multiple iterations use different physical destinationsfor registers (dynamic loop unrolling).

• Replace static register names from code with dynamicregister “pointers”

• Effectively increases size of register file

• Permit instruction issue to advance past integercontrol flow operations.

° Crucial: integer unit must “get ahead” of floating pointunit so that we can issue multiple iterations

° Other idea: Tomasulo building “DataFlow” graph.


Recall: Unrolled Loop That Minimizes Stalls

1 Loop:LD F0,0(R1)2 LD F6,-8(R1)3 LD F10,-16(R1)4 LD F14,-24(R1)5 ADDD F4,F0,F26 ADDD F8,F6,F27 ADDD F12,F10,F28 ADDD F16,F14,F29 SD 0(R1),F410 SD -8(R1),F811 SD -16(R1),F1212 SUBI R1,R1,#3213 BNEZ R1,LOOP14 SD 8(R1),F16 ; 8-32 = -24

&8��0��9$:��'��;<��1


Why issue in order?

° In-order issue permits us to analyze data flow of program° This way, we know exactly which results should flow to

which subsequent instructions• If we issued out-of-order, we would confuse RAW and

WAR hazards!• The most advanced machines that I know of all issue

in order.° This idea works perfectly well “in principle” with multiple

instructions issued per clock:• Need to multi-port “rename table” and be able to rename a

sequence of instructions together• Need to be able to issue to multiple reservation stations in a

single cycle.• Need to have 2x number of read ports and 1x number of

write ports in register file.° In-order issue can be serious bottleneck when issuing

multiple instructions per clock-cycle


Branches must be resolved quickly for loop overlap!

° In our example, we relied on the fact that brancheswere under control of “fast” integer unit in order toget overlap!

Loop: LD F0 0 R1MULTD F4 F0 F2SD F4 0 R1SUBI R1 R1 #8BNEZ R1 Loop

° What happens if branch depends on result of multd??• We completely lose all of our advantages!

• Need to be able to “predict” branch outcome.

• If we were to predict that branch was taken, this wouldbe right most of the time.

° Problem much worse for superscalar machines!10/25/99 ©UCB Fall 1999 CS152 / Kubiatowicz

Lec16.64

Independent “Fetch” unit

Instruction Fetchwith

Branch Prediction

Out-Of-OrderExecution

Unit

Correctness FeedbackOn Branch Results

Stream of InstructionsTo Execute

° Instruction fetch decoupled from execution

° Often issue logic (+ rename) included with Fetch


° Prediction has become essential to getting goodperformance from scalar instruction streams.

° We will discuss predicting branches. However,architects are now predicting everything: datadependencies, actual data, and results of groups ofinstructions:

• At what point does computation become a probabilistic operation +verification?

• We are pretty close with control hazards already…

° Why does prediction work?• Underlying algorithm has regularities.

• Data that is being operated on has regularities.

• Instruction sequence has redundancies that are artifacts of way thathumans/compilers think about problems.

° Prediction ⇒ Compressible information streams?

Prediction: Branches, Dependencies, Data


Dynamic Branch Prediction

° Prediction could be “Static” (at compile time) or“Dynamic” (at runtime)

• For our example, if we were to statically predict“taken”, we would only be wrong once each passthrough loop

° Is dynamic branch prediction better than staticbranch prediction?

• Seems to be. Still some debate to this effect

• Today, lots of hardware being devoted to dynamicbranch predictors.


° Address of branch index to get prediction AND branchaddress (if taken)• Must check for branch match now, since can’t use wrong branch address

• Grab predicted PC from table since may take several cycles to compute

° Update predicted PC when branch is actually resolved

° Return instruction addresses predicted with stack

��'(�� '��

)*

��#��'��+�

�,

��'��-��-��

Simple dynamic prediction: Branch Target Buffer (BTB)



° Performance = ƒ(accuracy, cost of misprediction)

° Branch History Table: Lower bits of PC addressindex table of 1-bit values

• Says whether or not branch taken last time

• No address check

° Problem: in a loop, 1-bit BHT will cause twomispredictions (avg is 9 iteratios before exit):

• End of loop case, when it exits instead of looping as before

• First time through loop on next time through code, when itpredicts exit instead of looping


° Solution: 2-bit scheme where change predictiononly if get misprediction twice: (Figure 4.13, p. 264)

° Red: stop, not taken

° Green: go, taken

° Adds hysteresis to decision making process


T

TNT

NT

Predict Taken

Predict Not Taken

Predict Taken

Predict Not TakenT

NT

T

NT


BHT Accuracy

° Mispredict because either:• Wrong guess for that branch

• Got branch history of wrong branch when index the table

° 4096 entry table programs vary from 1%misprediction (nasa7, tomcatv) to 18% (eqntott),with spice at 9% and gcc at 12%

° 4096 about as good as infinite table(in Alpha 211164)


Correlating Branches

° Hypothesis: recent branches are correlated; that is,behavior of recently executed branches affects predictionof current branch

° Two possibilities; Current branch depends on:• Last m most recently executed branches anywhere in program

Produces a “GA” (for “global address”) in the Yeh and Pattclassification (e.g. GAg)

• Last m most recent outcomes of same branch.Produces a “PA” (for “per address”) in same classification (e.g. PAg)

° Idea: record m most recently executed branches as takenor not taken, and use that pattern to select the properbranch history table entry

• A single history table shared by all branches (appends a “g” at end),indexed by history value.

• Address is used along with history to select table entry (appends a “p”at end of classification)

• If only portion of address used, often appends an “s” to indicate “set-indexed” tables (I.e. GAs)


Correlating Branches

(2,2) GAs predictor• First 2 means that we keep two

bits of history

• Second means that we have 2bit counters in each slot.

• Then behavior of recentbranches selects between,say, four predictions of nextbranch, updating just thatprediction

• Note that the original two-bitcounter solution would be a(0,2) GAs predictor

• Note also that aliasing ispossible here...

Branch address

2-bits per branch predictors

PredictionPrediction

2-bit global branch history register

° For instance, consider global history, set-indexedBHT. That gives us a GAs history table.

+�'(��./��'��


Accuracy of Different Schemes

Fre

quen

cy o

f M

ispr

edic

tion

s

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

nasa

7

mat

rix3

00

tom

catv

dodu

cd

spic

e

fppp

p gcc

espr

esso

eqnt

ott li

0%

1%

5%

6% 6%

11%

4%

6%

5%

1%

4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)

4096 Entries 2-bit BHTUnlimited Entries 2-bit BHT1024 Entries (2,2) BHT

0%

18%

Fre

qu

ency

of

Mis

pre

dic

tio

ns


HW support for More ILP

° Avoid branch prediction by turning branches intoconditionally executed instructions:

if (x) then A = B op C else NOP• If false, then neither store result nor cause exception

• Expanded ISA of Alpha, MIPS, PowerPC, SPARC haveconditional move; PA-RISC can annul any following instr.

• EPIC: 64 1-bit condition fields selected so conditional execution

° Drawbacks to conditional instructions• Still takes a clock even if “annulled”

• Stall if condition evaluated late

• Complex conditions reduce effectiveness;condition becomes known late in pipeline


Now what about exceptions???

° Out-of-order commit really messes up our chance toget precise exceptions!

• When committing results out-of-order, register filecontains results from later instructions while earlierones have not completed yet.

• What if need to cause exception on one of those earlyinstructions??

° Need to be able to “rollback” register file toconsistent state

• Remember that “precise” means that there is some PCsuch that: all instructions before have committedresults, and none after have committed results.

° Big problem for branch prediction as well:What if prediction wrong??


° Speculation is a form of guessing.

° Important for branch prediction:• Need to “take our best shot” at predicting branch direction.

• If we issue multiple instructions per cycle, lose lots of potentialinstructions otherwise:

- Consider 4 instructions per cycle

- If take single cycle to decide on branch, waste from 4 - 7instruction slots!

° If we speculate and are wrong, need to back up andrestart execution to point at which we predictedincorrectly:

• This is exactly same as precise exceptions!

° Technique for both precise interrupts/exceptions andspeculation: in-order completion or commit

Relationship between precise interrupts and specultation:


HW support for precise interrupts

° Need HW buffer for results ofuncommitted instructions:reorder buffer

• 3 fields: instr, destination, value

• Reorder buffer can be operandsource => more registers like RS

• Use reorder buffer number instead ofreservation station when executioncompletes

• Supplies operands betweenexecution complete & commit

• Once operand commits,result is put into register

• Instructionscommit

• As a result, its easy to undospeculated instructionson mispredicted branchesor on exceptions

ReorderBuffer

FPOp

Queue

FP Adder FP Adder

Res Stations Res Stations

FP Regs


1. Issue—get instruction from FP Op Queue• If reservation station and reorder buffer slot free, issue instr & send

operands & reorder buffer no. for destination (this stage sometimescalled “dispatch”)

2. Execution—operate on operands (EX)• When both operands ready then execute; if not ready, watch CDB for

result; when both in reservation station, execute; checks RAW(sometimes called “issue”)

3. Write result—finish execution (WB)• Write on Common Data Bus to all awaiting FUs & reorder buffer;

mark reservation station available.

4. Commit—update register with reorder result• When instr. at head of reorder buffer & result present, update

register with result (or store to memory) and remove instr fromreorder buffer.

• Mispredicted branch or interrupt flushes reorder buffer (sometimescalled “graduation”)

Four Steps of Speculative Tomasulo Algorithm


3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB1

Tomasulo With Reorder buffer:

�� 0

��

��

�� !��

� �1� �&

� �%

� �%

� ��

� ��

� �

----

F0F0<val2><val2>

<val2><val2>ST 0(R3),F0ST 0(R3),F0

ADDD F0,F4,F6ADDD F0,F4,F6YY

ExEx

F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) YY

---- BNE F2,<…>BNE F2,<…> NN

F2F2

F10F10

F0F0

DIVD F2,F10,F6DIVD F2,F10,F6

ADDD F10,F4,F0ADDD F10,F4,F0

LD F0,10(R2)LD F0,10(R2)

NN

NN

NN

��*

��

��

2�3��

#�� 0

1 10+R21 10+R2��

��##��

��


Dynamic Scheduling in PowerPC 604 and Pentium Pro

° Both In-order Issue, Out-of-order execution, In-order Commit

PPro central reservation station for anyfunctional units with one bus shared by abranch and an integer unit


Dynamic Scheduling in PowerPC 604 and Pentium Pro

Parameter PPC PPro

Max. instructions issued/clock 4 3

Max. instr. complete exec./clock 6 5

Max. instr. commited/clock 6 3

Instructions in reorder buffer 16 40

Number of rename buffers 12 Int/8 FP 40

Number of reservations stations 12 20

No. integer functional units (FUs) 2 2No. floating point FUs 1 1No. branch FUs 1 1No. complex integer FUs 1 0No. memory FUs 1 1 load +1 store


Dynamic Scheduling in Pentium Pro

° PPro doesn’t pipeline 80x86 instructions

° PPro decode unit translates the Intel instructions into72-bit micro-operations (- MIPS)

° Sends micro-operations to reorder buffer & reservationstations

° Takes 1 clock cycle to determine length of 80x86instructions + 2 more to create the micro-operations

° Most instructions translate to 1 to 4 micro-operations

° Complex 80x86 instructions are executed by aconventional microprogram (8K x 72 bits) that issueslong sequences of micro-operations


Limits to Multi-Issue Machines

° Inherent limitations of ILP• 1 branch in 5: How to keep a 5-way superscalar busy?

• Latencies of units: many operations must be scheduled

• Need about Pipeline Depth x No. Functional Units of independentinstructions to keep fully busy

• Increase ports to Register File

- VLIW example needs 7 read and 3 write for Int. Reg.& 5 read and 3 write for FP reg

• Increase ports to memory

• Current state of the art: Many hardware structures (such asissue/rename logic) has delay proportional to square of number ofinstructions issued/cycle


° Conflicting studies of amount• Benchmarks (vectorized Fortran FP vs. integer C programs)

• Hardware sophistication

• Compiler sophistication

° How much ILP is available using existingmechanims with increasing HW budgets?

° Do we need to invent new HW/SW mechanisms tokeep on processor performance curve?

• Intel MMX

• Motorola AltaVec

• Supersparc Multimedia ops, etc.

Limits to ILP


Initial HW Model here; MIPS compilers.

Assumptions for ideal/perfect machine to start:

1. Register renaming–infinite virtual registers and allWAW & WAR hazards are avoided

2. Branch prediction–perfect; no mispredictions

3. Jump prediction–all jumps perfectly predicted =>machine with perfect speculation & an unboundedbuffer of instructions available

4. Memory-address alias analysis–addresses areknown & a store can be moved before a load providedaddresses not equal

1 cycle latency for all instructions; unlimited number ofinstructions issued per clock cycle

Limits to ILP


Programs

0

20

40

60

80

100

120

140

160

gcc espresso li fpppp doducd tomcatv

54.862.6

17.9

75.2

118.7

150.1

Integer: 18 - 60

FP: 75 - 150

IPC

Upper Limit to ILP: Ideal Machine


Program

0

10

20

30

40

50

60


35

41

16

61

5860

9

1210

48

15

67 6

46

13

45

6 6 7

45

14

45

2 2 2

29

4

19

46

Perfect Selective predictor Standard 2-bit Static None

Change from Infinitewindow to examine to2000 and maximumissue of 64 instructionsper clock cycle

ProfileBHT (512)Pick Cor. or BHTPerfect No prediction

FP: 15 - 45

Integer: 6 - 12

IPC

More Realistic HW: Branch Impact


Program

0

10

20

30

40

50

60


11

15

12

29

54

10

15

12

49

16

10

1312

35

15

44

910

11

20

11

28

5 5 6 5 57

4 45

45 5

59

45

Infinite 256 128 64 32 None

Change 2000 instrwindow, 64 instrissue, 8K 2 levelPrediction

Integer: 5 - 15

FP: 11 - 45

IPC

More Realistic HW: Register Impact (rename regs)

64 None256Infinite 32128


Program

0

5

10

15

20

25

30

35

40

45

50


10

15

12

49

16

45

7 79

49

16

45 4 4

6 53

53 3 4 4

45

Perfect Global/stack Perfect Inspection None

Change 2000 instrwindow, 64 instrissue, 8K 2 levelPrediction, 256renaming registers

FP: 4 - 45(Fortran,no heap)

Integer: 4 - 9

IPC

More Realistic HW: Alias Impact

NoneGlobal/Stack perf;heap conflicts

Perfect Inspec.Assem.


Program

0

10

20

30

40

50

60

gcc expresso li fpppp doducd tomcatv

10

15

12

52

17

56

10

15

12

47

16

10

1311

35

15

34

910 11

22

12

8 8 9

14

9

14

6 6 68

79

4 4 4 54

6

3 2 3 3 3 3

45

22

Infinite 256 128 64 32 16 8 4

Perfect disambiguation(HW), 1K SelectivePrediction, 16 entryreturn, 64 registers,issue as many aswindow

Integer: 6 - 12

FP: 8 - 45

IPC

Realistic HW for ‘9X: Window Impact

64 16256Infinite 32128 8 4


° 8-scalar IBM Power-2 @ 71.5 MHz (5 stage pipe)vs. 2-scalar Alpha @ 200 MHz (7 stage pipe)

Benchmark

0

100

200

300

400

500

600

700

800

900

espr

esso

li

eqnt

ott

com

pres

s sc gcc

spic

e

dodu

c

mdl

jdp2

wav

e5

tom

catv or

a

alvi

nn ear

mdl

jsp2

swm

256

su2c

or

hydr

o2d

nasa

fppp

pBraniac vs. Speed Demon(1993)


Summary #1/2

° Reservations stations: renaming to larger set ofregisters + buffering source operands

• Prevents registers as bottleneck

• Avoids WAR, WAW hazards of Scoreboard

• Allows loop unrolling in HW

° Not limited to basic blocks(integer units gets ahead, beyond branches)

° Helps cache misses as well

° Lasting Contributions• Dynamic scheduling

• Register renaming

• Load/store disambiguation

° 360/91 descendants are Pentium II; PowerPC 604;MIPS R10000; HP-PA 8000; Alpha 21264


° Dynamic hardware schemes can unroll loopsdynamically in hardware

° Branch prediction very important to good performance

° Precise exceptions/Speculation: Out-of-orderexecution, In-order commit (reorder buffer)

° Superscalar and VLIW: CPI < 1 (IPC > 1)• Dynamic issue vs. Static issue

• More instructions issue at same time => larger hazard penalty

• Limitation is often number of instructions that you can successfullyfetch and decode per cycle ⇒ “Flynn barrier”

° SW Pipelining• Symbolic Loop Unrolling to get most from pipeline with little code

expansion, little overhead

Summary #2/2