Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures
description
Transcript of Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures
![Page 1: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/1.jpg)
Advanced Computer Architecture5MD00 / 5Z033
ILP architectures
Henk Corporaalwww.ics.ele.tue.nl/~heco/courses/aca
2007
![Page 2: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/2.jpg)
04/22/23 ACA H.Corporaal 2
Topics• Introduction• Hazards
– Data dependences– Control dependences
• Branch prediction• Dependences limit ILP: scheduling• Out-Of-Order execution: Hardware speculation• Multiple issue• How much ILP is there?
![Page 3: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/3.jpg)
04/22/23 ACA H.Corporaal 3
IntroductionILP = Instruction level parallelism• multiple operations (or instructions) can be executed in
parallel
Needed:• Sufficient resources• Parallel scheduling
– Hardware solution– Software solution
• Application should contain ILP
![Page 4: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/4.jpg)
04/22/23 ACA H.Corporaal 4
Hazards• Three types of hazards
– Structural– Data dependence– Control dependence
• Hazards cause scheduling problems
![Page 5: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/5.jpg)
04/22/23 ACA H.Corporaal 5
Data dependences• RaW read after write• WaR write after read• WaW write after write
![Page 6: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/6.jpg)
04/22/23 ACA H.Corporaal 6
Control Dependences
C input code:
CFG: 1 sub t1, a, b bgz t1, 2, 3
4 mul y,a,b …………..
3 rem r, b, a goto 4
2 rem r, a, b goto 4
if (a > b) { r = a % b; } else { r = b % a; }y = a*b;
How real are control dependences?
![Page 7: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/7.jpg)
04/22/23 ACA H.Corporaal 7
Branch Prediction
![Page 8: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/8.jpg)
04/22/23 ACA H.Corporaal 8
Branch PredictionMotivation
• High branch penalties in pipelined processors:• With on average one out of five instructions being
a branch, the maximum ILP is five• Situation even worse for multiple-issue processors,
because we need to provide an instruction stream of n instructions per cycle.
• Idea: predict the outcome of branches based on their history and execute instructions speculatively
![Page 9: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/9.jpg)
04/22/23 ACA H.Corporaal 9
5 Branch Prediction Schemes• 1-bit Branch Prediction Buffer• 2-bit Branch Prediction Buffer• Correlating Branch Prediction Buffer• Branch Target Buffer• Return Address Predictors
+ A way to get rid of those malicious branches
![Page 10: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/10.jpg)
04/22/23 ACA H.Corporaal 10
1-bit Branch Prediction Buffer• 1-bit branch prediction buffer or branch history table:
• Buffer is like a cache without tags• Does not help for simple MIPS pipeline because target address calculations in same
stage as branch condition calculation
10…..10 101 00
01010110
PC
BHT
![Page 11: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/11.jpg)
04/22/23 ACA H.Corporaal 11
Branch Prediction Buffer: 1 bit prediction
Problems:• Aliasing: lower K bits of different branch instructions could be the same
– Soln: Use tags (the buffer becomes a tag); however very expensive
• Loops are predicted wrong twice
– Soln: Use n-bit saturation counter prediction
* taken if counter 2 (n-1)
* not-taken if counter < 2 (n-1)
– A 2 bit saturating counter predicts a loop wrong only once
Branch address 2 K entries(K bits)
prediction bit
![Page 12: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/12.jpg)
04/22/23 ACA H.Corporaal 12
• Solution: 2-bit scheme where prediction is changed only if mispredicted twice
• Can be implemented as a saturating counter:
2-bit Branch Prediction Buffer
T
T
NT
Predict Taken
Predict Not Taken
Predict Taken
Predict Not TakenT
NT
T
NT
NT
![Page 13: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/13.jpg)
04/22/23 ACA H.Corporaal 13
Correlating Branches• Fragment from SPEC92 benchmark eqntott:
if (aa==2) aa = 0;if (bb==2) bb=0;if (aa!=bb){..}
subi R3,R1,#2b1: bnez R3,L1
add R1,R0,R0L1: subi R3,R2,#2b2: bnez R3,L2
add R2,R0,R0L2: sub R3,R1,R2b3: beqz R3,L3
![Page 14: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/14.jpg)
04/22/23 ACA H.Corporaal 14
Correlating Branch Predictor
Idea: behavior of current branch is related to taken/not taken history of recently executed branches
– Then behavior of recent branches selects between, say, 4 predictions of next branch, updating just that prediction
• (2,2) predictor: 2-bit global, 2-bit local
• (k,n) predictor uses behavior of last k branches to choose from 2k predictors, each of which is n-bit predictor
4 bits from branch address
2-bits per branch local predictors
Prediction
2-bit global branch history
(01 = not taken, then taken)
shiftregister
![Page 15: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/15.jpg)
04/22/23 ACA H.Corporaal 15
Branch Correlation Using Branch HistoryTwo schemes (a, k, m, n)
• PA: Per address history, a > 0• GA: Global history, a = 0
n-bit saturating Up/Down Counter Prediction
Table size (usually n = 2): #bits = k * 2a + 2k * 2m *n
Variant: Gshare (Scott McFarling’93): GA which takes logic OR of PC address bits and branch history bits
Branch Address0 1 2k-1
0
1
2m-1
Branch History Table
a k
m
Pattern History Table
![Page 16: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/16.jpg)
04/22/23 ACA H.Corporaal 16
Accuracy (taking the best combination of parameters):
Predictor Size (bytes)64 128
Bra
nch
Pred
ictio
n A
ccur
acy
(%)
256 1K 2K 4K 8K 16K 32K 64K
89
91
95
969798
929394
PA(10, 6, 4, 2)
GA(0,11,5,2)
BimodalGAsPAs
![Page 17: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/17.jpg)
04/22/23 ACA H.Corporaal 17
0%1%
5%6% 6%
11%
4%
6%5%
1%
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
nasa7 matrix300 tomcatv doducd spice fpppp gcc espresso eqntott li
Freq
uenc
y of
Mis
pred
ictio
ns
4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)
Accuracy of Different Branch Predictors
4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT
0%
Mis
pred
ictio
ns R
ate
18%
![Page 18: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/18.jpg)
04/22/23 ACA H.Corporaal 18
BHT Accuracy• Mispredict because either:
– Wrong guess for that branch– Got branch history of wrong branch when index the
table• 4096 entry table: misprediction rates vary from
1% (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12%
• For SPEC92, 4096 about as good as infinite table
• Real programs + OS more like gcc
![Page 19: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/19.jpg)
04/22/23 ACA H.Corporaal 19
Branch Target Buffer• Branch condition is not enough !!• Branch Target Buffer (BTB): Tag and Target address
Tag branch PC PC if taken
=? Branchprediction(often in separatetable)
Yes: instruction is branch. Use predicted PC as next PC if branch predicted taken.No: instruction is not a
branch. Proceed normally
10…..10 101 00PC
![Page 20: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/20.jpg)
04/22/23 ACA H.Corporaal 20
Instruction Fetch Stage
Not shown: hardware needed when prediction was wrong
InstructionMemoryP
C
Inst
ruct
ion
regi
ster
4
BTB
found & takentarget address
![Page 21: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/21.jpg)
04/22/23 ACA H.Corporaal 21
Special Case: Return Addresses
• Register indirect branches: hard to predict target address– MIPS instruction: jr r31 ; PC = r31– useful for
• implementing switch/case statements• FORTRAN computed GOTOs• procedure return (mainly)
• SPEC89: 85% such branches for procedure return• Since stack discipline for procedures, save return
address in small buffer that acts like a stack: 8 to 16 entries has very high hit rate
![Page 22: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/22.jpg)
04/22/23 ACA H.Corporaal 22
Dynamic Branch Prediction Summary
• Prediction important part of scalar execution• Branch History Table: 2 bits for loop accuracy• Correlation: Recently executed branches
correlated with next branch– Either different branches– Or different executions of same branch
• Branch Target Buffer: include branch target address (& prediction)
• Return address stack for prediction of indirect jumps
![Page 23: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/23.jpg)
04/22/23 ACA H.Corporaal 23
Or: Avoid branches !
![Page 24: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/24.jpg)
04/22/23 ACA H.Corporaal 24
• Avoid branch prediction by turning branches into conditional or predicated instructions:
• If false, then neither store result nor cause exception– Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional
move; PA-RISC can annul any following instr.– IA-64/Itanium: conditional execution of any instruction
• Examples:if (R1==0) R2 = R3; CMOVZ R2,R3,R1
if (R1 < R2) SLT R9,R1,R2 R3 = R1; CMOVNZ R3,R1,R9else CMOVZ R3,R2,R9 R3 = R2;
Predicated Instructions
![Page 25: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/25.jpg)
04/22/23 ACA H.Corporaal 25
Dynamic Scheduling
![Page 26: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/26.jpg)
04/22/23 ACA H.Corporaal 26
Dynamic Scheduling Principle• What we examined so far is static scheduling
– Compiler reorders instructions so as to avoid hazards and reduce stalls• Dynamic scheduling: hardware rearranges instruction execution to reduce stalls• Example:
DIV.D F0,F2,F4 ; takes 24 cycles and; is not pipeline
ADD.D F10,F0,F8
SUB.D F12,F8,F14
• Key idea: Allow instructions behind stall to proceed• Book describes Tomasulo algorithm, but we describe general idea
This instruction cannot continueeven though it does not dependon anything
![Page 27: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/27.jpg)
04/22/23 ACA H.Corporaal 27
Advantages ofDynamic Scheduling
• Handles cases when dependences unknown at compile time – e.g., because they may involve a memory reference
• It simplifies the compiler • Allows code compiled for one or no pipeline to
run efficiently on a different pipeline • Hardware speculation, a technique with
significant performance advantages, that builds on dynamic scheduling
![Page 28: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/28.jpg)
04/22/23 ACA H.Corporaal 28
Superscalar ConceptInstructionMemory
InstructionCache
Decoder
BranchUnit ALU-1 ALU-2 Logic &
ShiftLoadUnit
StoreUnit
ReorderBuffer Register
File
DataCache
DataMemory
Reservation Stations
Address
DataData
Instruction
![Page 29: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/29.jpg)
04/22/23 ACA H.Corporaal 29
Superscalar Issues• How to fetch multiple instructions in time (across basic block
boundaries) ?• Predicting branches• Non-blocking memory system• Tune #resources(FUs, ports, entries, etc.)• Handling dependencies• How to support precise interrupts?• How to recover from mis-predicted branch path?
• For the latter two issues we need to look at sequential look-ahead and architectural state – Ref: Johnson 91
![Page 30: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/30.jpg)
04/22/23 ACA H.Corporaal 30
Example of Superscalar Processor Execution
• Superscalar processor organization:– simple pipeline: IF, EX, WB– fetches 2 instructions each cycle– 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier– Instruction window (buffer between IF and EX stage) is of size 2– FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc
Cycle 1 2 3 4 5 6 7L.D F6,32(R2)L.D F2,48(R3)MUL.D F0,F2,F4SUB.D F8,F2,F6DIV.D F10,F0,F6ADD.D F6,F8,F2MUL.D F12,F2,F4
![Page 31: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/31.jpg)
04/22/23 ACA H.Corporaal 31
Example of Superscalar Processor Execution
• Superscalar processor organization:– simple pipeline: IF, EX, WB– fetches 2 instructions each cycle– 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier– Instruction window (buffer between IF and EX stage) is of size 2– FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc
Cycle 1 2 3 4 5 6 7L.D F6,32(R2) IFL.D F2,48(R3) IFMUL.D F0,F2,F4SUB.D F8,F2,F6DIV.D F10,F0,F6ADD.D F6,F8,F2MUL.D F12,F2,F4
![Page 32: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/32.jpg)
04/22/23 ACA H.Corporaal 32
Example of Superscalar Processor Execution
• Superscalar processor organization:– simple pipeline: IF, EX, WB– fetches 2 instructions each cycle– 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier– Instruction window (buffer between IF and EX stage) is of size 2– FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc
Cycle 1 2 3 4 5 6 7L.D F6,32(R2) IF EXL.D F2,48(R3) IF EXMUL.D F0,F2,F4 IFSUB.D F8,F2,F6 IFDIV.D F10,F0,F6ADD.D F6,F8,F2MUL.D F12,F2,F4
![Page 33: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/33.jpg)
04/22/23 ACA H.Corporaal 33
Example of Superscalar Processor Execution
• Superscalar processor organization:– simple pipeline: IF, EX, WB– fetches 2 instructions each cycle– 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier– Instruction window (buffer between IF and EX stage) is of size 2– FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc
Cycle 1 2 3 4 5 6 7L.D F6,32(R2) IF EX WBL.D F2,48(R3) IF EX WBMUL.D F0,F2,F4 IF EXSUB.D F8,F2,F6 IF EXDIV.D F10,F0,F6 IFADD.D F6,F8,F2 IFMUL.D F12,F2,F4
![Page 34: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/34.jpg)
04/22/23 ACA H.Corporaal 34
Example of Superscalar Processor Execution
• Superscalar processor organization:– simple pipeline: IF, EX, WB– fetches 2 instructions each cycle– 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier– Instruction window (buffer between IF and EX stage) is of size 2– FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc
Cycle 1 2 3 4 5 6 7L.D F6,32(R2) IF EX WBL.D F2,48(R3) IF EX WBMUL.D F0,F2,F4 IF EX EXSUB.D F8,F2,F6 IF EX EXDIV.D F10,F0,F6 IFADD.D F6,F8,F2 IFMUL.D F12,F2,F4
stall becauseof data dep.
cannot be fetched because window full
![Page 35: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/35.jpg)
04/22/23 ACA H.Corporaal 35
Example of Superscalar Processor Execution
• Superscalar processor organization:– simple pipeline: IF, EX, WB– fetches 2 instructions each cycle– 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier– Instruction window (buffer between IF and EX stage) is of size 2– FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc
Cycle 1 2 3 4 5 6 7L.D F6,32(R2) IF EX WBL.D F2,48(R3) IF EX WBMUL.D F0,F2,F4 IF EX EX EXSUB.D F8,F2,F6 IF EX EX WBDIV.D F10,F0,F6 IFADD.D F6,F8,F2 IF EXMUL.D F12,F2,F4 IF
![Page 36: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/36.jpg)
04/22/23 ACA H.Corporaal 36
Example of Superscalar Processor Execution
• Superscalar processor organization:– simple pipeline: IF, EX, WB– fetches 2 instructions each cycle– 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier– Instruction window (buffer between IF and EX stage) is of size 2– FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc
Cycle 1 2 3 4 5 6 7L.D F6,32(R2) IF EX WBL.D F2,48(R3) IF EX WBMUL.D F0,F2,F4 IF EX EX EX EXSUB.D F8,F2,F6 IF EX EX WBDIV.D F10,F0,F6 IFADD.D F6,F8,F2 IF EX EXMUL.D F12,F2,F4 IF
cannot execute structural hazard
![Page 37: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/37.jpg)
04/22/23 ACA H.Corporaal 37
Example of Superscalar Processor Execution
• Superscalar processor organization:– simple pipeline: IF, EX, WB– fetches 2 instructions each cycle– 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier– Instruction window (buffer between IF and EX stage) is of size 2– FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc
Cycle 1 2 3 4 5 6 7L.D F6,32(R2) IF EX WBL.D F2,48(R3) IF EX WBMUL.D F0,F2,F4 IF EX EX EX EX WBSUB.D F8,F2,F6 IF EX EX WBDIV.D F10,F0,F6 IF EXADD.D F6,F8,F2 IF EX EX WBMUL.D F12,F2,F4 IF ?
![Page 38: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/38.jpg)
04/22/23 ACA H.Corporaal 38
Register Renaming• A technique to eliminate anti- and output
dependencies• Can be implemented
– by the compiler• advantage: low cost• disadvantage: “old” codes perform poorly
– in hardware• advantage: binary compatibility• disadvantage: extra hardware needed
• We describe general idea
![Page 39: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/39.jpg)
04/22/23 ACA H.Corporaal 39
Register Renaming– there’s a physical register file larger than logical register file– mapping table associates logical registers with physical register– when an instruction is decoded
• its physical source registers are obtained from mapping table• its physical destination register is obtained from a free list• mapping table is updated
add r3,r3,4
R8
R7
R5
R1
R9
R2 R6
before:
mapping table:
free list:
r0
r1
r2
r3
r4
add R2,R1,4
R8
R7
R5
R2
R9
R6
after:
mapping table:
free list:
r0
r1
r2
r3
r4
![Page 40: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/40.jpg)
04/22/23 ACA H.Corporaal 40
Eliminating False Dependencies• How register renaming eliminates false
dependencies:
• Before:• addi r1, r2, 1• addi r2, r0, 0• addi r1, r0, 1
• After (free list: R7, R8, R9)• addi R7, R5, 1• addi R8, R0, 0• addi R9, R0, 1
![Page 41: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/41.jpg)
04/22/23 ACA H.Corporaal 41
Limitations of Multiple-Issue Processors• Available ILP is limited (we’re not programming with
parallelism in mind)• Hardware cost
– adding more functional units is easy– more memory ports and register ports needed– dependency check needs O(n2) comparisons
• Limitations of VLIW processors– Loop unrolling increases code size– Unfilled slots waste bits– Cache miss stalls pipeline
• Research topic: scheduling loads– Binary incompatibility (not EPIC)
![Page 42: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/42.jpg)
04/22/23 ACA H.Corporaal 42
Measuring available ILP: How?
• Using existing compiler• Using trace analysis
– Track all the real data dependencies (RaWs) of instructions from issue window• register dependence• memory dependence
– Check for correct branch prediction• if prediction correct continue• if wrong, flush schedule and start in next cycle
![Page 43: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/43.jpg)
04/22/23 ACA H.Corporaal 43
Trace analysis
Program
For i := 0..2
A[i] := i;
S := X+3;
Compiled code
set r1,0
set r2,3
set r3,&A
Loop: st r1,0(r3)
add r1,r1,1
add r3,r3,4
brne r1,r2,Loop
add r1,r5,3
Trace
set r1,0
set r2,3
set r3,&A
st r1,0(r3)
add r1,r1,1
add r3,r3,4
brne r1,r2,Loop
st r1,0(r3)
add r1,r1,1
add r3,r3,4
brne r1,r2,Loop
st r1,0(r3)
add r1,r1,1
add r3,r3,4
brne r1,r2,Loop
add r1,r5,3How parallel can this code be executed?
![Page 44: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/44.jpg)
04/22/23 ACA H.Corporaal 44
Trace analysis
Parallel Traceset r1,0 set r2,3 set r3,&A
st r1,0(r3) add r1,r1,1 add r3,r3,4
st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop
st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop
brne r1,r2,Loop
add r1,r5,3
Max ILP =
Speedup = Lparallel / Lserial = 16 / 6 = 2.7
Is this the maximum?
![Page 45: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/45.jpg)
04/22/23 ACA H.Corporaal 45
Ideal ProcessorAssumptions for ideal/perfect processor:
1. Register renaming – infinite number of virtual registers => all register WAW & WAR hazards avoided2. Branch and Jump prediction – Perfect => all program instructions available for execution3. Memory-address alias analysis – addresses are known. A store can be moved before a load provided addresses not equal
Also: – unlimited number of instructions issued/cycle (unlimited resources), and– unlimited instruction window– perfect caches– 1 cycle latency for all instructions (FP *,/)
Programs were compiled using MIPS compiler with maximum optimization level
![Page 46: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/46.jpg)
04/22/23 ACA H.Corporaal 46
Upper Limit to ILP: Ideal Processor
Programs
Inst
ruct
ion
Issu
es p
er c
ycle
0
20
40
60
80
100
120
140
160
gcc espresso li fpppp doducd tomcatv
54.862.6
17.9
75.2
118.7
150.1
Integer: 18 - 60 FP: 75 - 150
IPC
![Page 47: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/47.jpg)
04/22/23 ACA H.Corporaal 47
35
41
16
6158
60
9
1210
48
15
6 7 6
46
13
45
6 6 7
45
14
45
2 2 2
29
4
19
46
0
10
20
30
40
50
60
gcc espresso li fpppp doducd tomcatvProgram
Inst
ruct
ion
issu
es p
er c
ycle
Perfect Selective predictor Standard 2-bit Static None
Window Size and Branch Impact• Change from infinite window to examine 2000
and issue at most 64 instructions per cycle FP: 15 - 45
Integer: 6 – 12
IPC
Perfect Tournament BHT(512) Profile No prediction
![Page 48: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/48.jpg)
04/22/23 ACA H.Corporaal 48
11
1512
29
54
10
1512
49
16
1013 12
35
15
44
9 10 11
20
11
28
5 5 6 5 5 74 4 5
4 5 5
59
45
0
10
20
30
40
50
60
70
gcc espresso li fpppp doducd tomcatvProgram
Inst
ruct
ion
issu
es p
er c
ycle
Infinite 256 128 64 32 None
Impact of Limited Renaming Registers• Changes: 2000 instr. window, 64 instr. issue, 8K 2-level
predictor (slightly better than tournament predictor)
Integer: 5 - 15 FP: 11 - 45
IP
C
Infinite 256 128 64 32
![Page 49: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/49.jpg)
04/22/23 ACA H.Corporaal 49
Program
Inst
ruct
ion
issu
es p
er c
ycle
0
5
10
15
20
25
30
35
40
45
50
gcc espresso li fpppp doducd tomcatv
10
15
12
49
16
45
7 79
49
16
45 4 4
6 53
53 3 4 4
45
Perfect Global/stack Perfect Inspection None
Memory Address Alias Impact• Changes: 2000 instr. window, 64 instr. issue, 8K
2-level predictor, 256 renaming registers
FP: 4 - 45(Fortran,no heap)
Integer: 4 - 9IPC
Perfect Global/stack perfect Inspection None
![Page 50: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/50.jpg)
04/22/23 ACA H.Corporaal 50
Program
Inst
ruct
ion
issu
es p
er c
ycle
0
10
20
30
40
50
60
gcc expresso li fpppp doducd tomcatv
10
15
12
52
17
56
10
15
12
47
16
10
1311
35
15
34
910 11
22
12
8 8 9
14
9
14
6 6 68
79
4 4 4 5 46
3 2 3 3 3 3
45
22
Infinite 256 128 64 32 16 8 4
Window Size Impact• Assumptions: Perfect disambiguation, 1K Selective predictor, 16
entry return stack, 64 renaming registers, issue as many as window
Integer: 6 - 12
FP: 8 - 45
IPC
![Page 51: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/51.jpg)
04/22/23 ACA H.Corporaal 51
How to Exceed ILP Limits of this Study?• WAR and WAW hazards through memory: eliminated
WAW and WAR hazards through register renaming, but not for memory operands
• Unnecessary dependences – (compiler did not unroll loops so iteration variable
dependence)• Overcoming the data flow limit: value prediction,
predicting values and speculating on prediction– Address value prediction and speculation predicts addresses
and speculates by reordering loads and stores. Could provide better aliasing analysis
![Page 52: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/52.jpg)
04/22/23 ACA H.Corporaal 52
Workstation Microprocessors 3/2001
Source: Microprocessor Report, www.MPRonline.com
• Max issue: 4 instructions (many CPUs)Max rename registers: 128 (Pentium 4) Max BHT: 4K x 9 (Alpha 21264B), 16Kx2 (Ultra III)Max Window Size (OOO): 126 intructions (Pent. 4)Max Pipeline: 22/24 stages (Pentium 4)
![Page 53: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/53.jpg)
04/22/23 ACA H.Corporaal 53
SPEC 2000 Performance 3/2001 Source: Microprocessor Report, www.MPRonline.com
1.6X
3.8X
1.2X
1.7X
1.5X
![Page 54: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/54.jpg)
04/22/23 ACA H.Corporaal 54
Conclusions• 1985-2002: >1000X performance (55% / y)
• Hennessy: industry has been following a roadmap of ideas known in 1985 to exploit Instruction Level Parallelism and (real) Moore’s Law to get 1.55X/year– Caches, (Super)Pipelining, Superscalar, Branch
Prediction, Out-of-order execution, Trace cache
• After 2002 slowdown (about 20%/y)
![Page 55: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures](https://reader036.fdocuments.us/reader036/viewer/2022062521/568167cf550346895ddd1f9f/html5/thumbnails/55.jpg)
04/22/23 ACA H.Corporaal 55
Conclusions (cont'd)• ILP limits: To make performance progress in future
need to have explicit parallelism from programmer vs. implicit parallelism of ILP exploited by compiler/HW?
• Other problem:– Processor-memory performance gap– VLSI scaling problems (wiring)– Energy / leakage problems
• However: other forms of parallelism come to rescue