CS422 Computer Architecturebr/iitk-webpage/courses/cs...Hardware-Based Speculation Combination of...
Transcript of CS422 Computer Architecturebr/iitk-webpage/courses/cs...Hardware-Based Speculation Combination of...
![Page 1: CS422 Computer Architecturebr/iitk-webpage/courses/cs...Hardware-Based Speculation Combination of branch prediction, speculation, and dynamic scheduling Data flow execution: instruction](https://reader030.fdocuments.us/reader030/viewer/2022040601/5e92ba67661f8f371a469f21/html5/thumbnails/1.jpg)
CS422Computer Architecture
Spring 2004
Lecture 17, 26 Feb 2004
Bhaskaran RamanDepartment of CSE
IIT Kanpur
http://web.cse.iitk.ac.in/~cs422/index.html
![Page 2: CS422 Computer Architecturebr/iitk-webpage/courses/cs...Hardware-Based Speculation Combination of branch prediction, speculation, and dynamic scheduling Data flow execution: instruction](https://reader030.fdocuments.us/reader030/viewer/2022040601/5e92ba67661f8f371a469f21/html5/thumbnails/2.jpg)
Speculation● Wish to move instructions across branches
– To eliminate possible stalls– For better scheduling– Appropriate conditional instructions may not
always exist● Example:
if (N == 0) {A = *X;
} else {A++;
}
![Page 3: CS422 Computer Architecturebr/iitk-webpage/courses/cs...Hardware-Based Speculation Combination of branch prediction, speculation, and dynamic scheduling Data flow execution: instruction](https://reader030.fdocuments.us/reader030/viewer/2022040601/5e92ba67661f8f371a469f21/html5/thumbnails/3.jpg)
Speculation: An Example
● Compiler predicts that the “then” clause is most likely
● Speculatively schedules the “then” clause– Eliminates 2 stalls, and the JMP instruction
LW R1, 0(R2) // Load NBNEZ R1, L1 // Test NLW R3, 0(R4) // Load XLW R5, 0(R3) // Load *XJMP L2 // Skip else
L1: LW R5, 0(R6) // Load AADDI R5, R5, #1 // Incrmt.
L2: SW 0(R6), R3 // Store A
LW R1, 0(R2) // Load NLW R3, 0(R4) // Load XLW R5, 0(R3) // Load *XBEQZ R1, L3 // Test NLW R5, 0(R6) // Load AADDI R5, R5, #1 // Incrmt.
L2: SW 0(R6), R5 // Store A
![Page 4: CS422 Computer Architecturebr/iitk-webpage/courses/cs...Hardware-Based Speculation Combination of branch prediction, speculation, and dynamic scheduling Data flow execution: instruction](https://reader030.fdocuments.us/reader030/viewer/2022040601/5e92ba67661f8f371a469f21/html5/thumbnails/4.jpg)
Exception Behaviour
● Terminating vs. non-terminating exceptions● While doing such scheduling:
– Correct program ==> no extra terminating exceptions
– Incorrect program ==> should preserve any terminating exceptions
LW R1, 0(R2) // Load NLW R3, 0(R4) // Load XLW R5, 0(R3) // Load *XBEQZ R1, L3 // Test NLW R5, 0(R6) // Load AADDI R5, R5, #1 // Incrmt.
L2: SW 0(R6), R5 // Store A
![Page 5: CS422 Computer Architecturebr/iitk-webpage/courses/cs...Hardware-Based Speculation Combination of branch prediction, speculation, and dynamic scheduling Data flow execution: instruction](https://reader030.fdocuments.us/reader030/viewer/2022040601/5e92ba67661f8f371a469f21/html5/thumbnails/5.jpg)
Preserving Exception Behaviour
● Approach 1: ignore terminating exceptions for speculated instructions– Incorrect programs may not be terminated
LW R1, 0(R2) // Load NLW* R3, 0(R4) // Load X, speculatedLW* R5, 0(R3) // Load *X, speculatedBEQZ R1, L3 // Test NLW R5, 0(R6) // Load AADDI R5, R5, #1 // Incrmt.
L2: SW 0(R6), R5 // Store A
![Page 6: CS422 Computer Architecturebr/iitk-webpage/courses/cs...Hardware-Based Speculation Combination of branch prediction, speculation, and dynamic scheduling Data flow execution: instruction](https://reader030.fdocuments.us/reader030/viewer/2022040601/5e92ba67661f8f371a469f21/html5/thumbnails/6.jpg)
Preserving Exception Behaviour (continued)
● Approach 2: poison bits– Set poison bit in result register of conditional
instruction, if exception occurs– Raise exception if any other instruction uses that
register
LW R1, 0(R2) // Load NLW* R3, 0(R4) // Load X, set poison bit on exceptionLW* R5, 0(R3) // Load *X, set poison bit on exceptionBEQZ R1, L3 // Test NLW R5, 0(R6) // Load AADDI R5, R5, #1 // Incrmt.
L2: SW 0(R6), R5 // Store A
![Page 7: CS422 Computer Architecturebr/iitk-webpage/courses/cs...Hardware-Based Speculation Combination of branch prediction, speculation, and dynamic scheduling Data flow execution: instruction](https://reader030.fdocuments.us/reader030/viewer/2022040601/5e92ba67661f8f371a469f21/html5/thumbnails/7.jpg)
Preserving Exception Behaviour (continued)
● Extra register R10 gets used up● Extra instruction in “else” clause
LW R1, 0(R2) // Load NLW* R3, 0(R4) // Load X, speculativeADDI* R10, R1, #1 // N++, speculativeLW* R5, 0(R3) // Load *X, speculativeBEQZ R1, L3 // Test NLW R5, 0(R6) // Load AADDI R5, R5, #1 // Incrmt.MOV R10, R1 // So that R10 has N
L2: SW 0(R6), R5 // Store A
if (N == 0) {A = *X;N++;
} else {A++;
}
![Page 8: CS422 Computer Architecturebr/iitk-webpage/courses/cs...Hardware-Based Speculation Combination of branch prediction, speculation, and dynamic scheduling Data flow execution: instruction](https://reader030.fdocuments.us/reader030/viewer/2022040601/5e92ba67661f8f371a469f21/html5/thumbnails/8.jpg)
Preserving Exception Behaviour (continued)
● Approach 3: buffer results– Instructions boosted past branches, flagged as
boosted (in opcode)– Results of boosted instructions forwarded and
used, like in Tomasulo– When branch is reached, result of speculation is
checked● Result committed if prediction correct● Result discarded otherwise
– Solution close to fully hardware-based speculation
![Page 9: CS422 Computer Architecturebr/iitk-webpage/courses/cs...Hardware-Based Speculation Combination of branch prediction, speculation, and dynamic scheduling Data flow execution: instruction](https://reader030.fdocuments.us/reader030/viewer/2022040601/5e92ba67661f8f371a469f21/html5/thumbnails/9.jpg)
Boosted Instructions: An Example
● The “+” denotes a boosted instruction, and is boosted across the next branch, which is predicted taken
LW R1, 0(R2) // Load NLW+ R3, 0(R4) // Load X, boostedADDI+ R1, R1, #1 // N++, boostedLW+ R5, 0(R3) // Load *X, boostedBEQZ R1, L3 // Test NLW R5, 0(R6) // Load AADDI R5, R5, #1 // Incrmt.
L2: SW 0(R6), R5 // Store A
if (N == 0) {A = *X;N++;
} else {A++;
}
![Page 10: CS422 Computer Architecturebr/iitk-webpage/courses/cs...Hardware-Based Speculation Combination of branch prediction, speculation, and dynamic scheduling Data flow execution: instruction](https://reader030.fdocuments.us/reader030/viewer/2022040601/5e92ba67661f8f371a469f21/html5/thumbnails/10.jpg)
Hardware-Based Speculation● Combination of branch prediction,
speculation, and dynamic scheduling● Data flow execution: instruction executes as
soon as the data it requires is ready● Advantages over software approach:
– Memory disambiguation is better– Better branch prediction– Precise exception model– No book-keeping code– Works for “old” software too
● Disadvantage: hardware cost and complexity
![Page 11: CS422 Computer Architecturebr/iitk-webpage/courses/cs...Hardware-Based Speculation Combination of branch prediction, speculation, and dynamic scheduling Data flow execution: instruction](https://reader030.fdocuments.us/reader030/viewer/2022040601/5e92ba67661f8f371a469f21/html5/thumbnails/11.jpg)
Speculation in Tomasulo● Speculate using branch prediction● Go ahead and execute based on speculation● Use results of speculated instructions for
other instructions, just as in Tomasulo● But, commit result only after knowing if
speculation was correct– In-order commit– Using reorder buffer– Also achieves precise exceptions
![Page 12: CS422 Computer Architecturebr/iitk-webpage/courses/cs...Hardware-Based Speculation Combination of branch prediction, speculation, and dynamic scheduling Data flow execution: instruction](https://reader030.fdocuments.us/reader030/viewer/2022040601/5e92ba67661f8f371a469f21/html5/thumbnails/12.jpg)
The Reorder Buffer● Similar to the store buffer in functionality● Replaces the store and load buffers● Virtual registers are the reorder buffer entries
– The reservation stations are not virtual registers anymore
● Reorder Buffer Data Structure– Instruction type: branch, store, or ALU/load– Destination: register or memory location– Value: which has to be committed
![Page 13: CS422 Computer Architecturebr/iitk-webpage/courses/cs...Hardware-Based Speculation Combination of branch prediction, speculation, and dynamic scheduling Data flow execution: instruction](https://reader030.fdocuments.us/reader030/viewer/2022040601/5e92ba67661f8f371a469f21/html5/thumbnails/13.jpg)
Tomasulo Using the Reorder Buffer
FP Opn Queue
FP Regs
Resvn. Stns.Resvn. Stns.
FP ADD/SUBFP MUL/DIV
Opn. Bus
Opnd. Bus
Load resultsfrom mem.
To mem.
From instn. unit
Common Data Bus
Reorder Buffer
![Page 14: CS422 Computer Architecturebr/iitk-webpage/courses/cs...Hardware-Based Speculation Combination of branch prediction, speculation, and dynamic scheduling Data flow execution: instruction](https://reader030.fdocuments.us/reader030/viewer/2022040601/5e92ba67661f8f371a469f21/html5/thumbnails/14.jpg)
Pipeline Stages
● Issue, EX, WB, Commit● Issue allocates a reorder buffer entry
– Entries allocated in circular fashion
● Commit writes result back to destination– Frees up the reorder buffer entry– For branch instruction
● Prediction correct ==> commit● Else, flush reorder buffer
![Page 15: CS422 Computer Architecturebr/iitk-webpage/courses/cs...Hardware-Based Speculation Combination of branch prediction, speculation, and dynamic scheduling Data flow execution: instruction](https://reader030.fdocuments.us/reader030/viewer/2022040601/5e92ba67661f8f371a469f21/html5/thumbnails/15.jpg)
Summary of ILP Techniques● Software techniques
– Compiler scheduling, Loop unrolling, Software pipelining, Trace scheduling (VLIW), Static branch prediction, Speculation
● Hardware support for software– Conditional instructions, poison bits
● Hardware techniques– Hardware scheduling, Dynamic branch prediction,
Hardware speculation● Which hardware technique(s) to use?
![Page 16: CS422 Computer Architecturebr/iitk-webpage/courses/cs...Hardware-Based Speculation Combination of branch prediction, speculation, and dynamic scheduling Data flow execution: instruction](https://reader030.fdocuments.us/reader030/viewer/2022040601/5e92ba67661f8f371a469f21/html5/thumbnails/16.jpg)
How Much ILP is Available?
● Assume infinite hardware resources– Infinite virtual registers– Perfect branch prediction, jump prediction– Perfect memory disambiguation
● Every instruction is scheduled as early as possible– Restricted only by data flow
![Page 17: CS422 Computer Architecturebr/iitk-webpage/courses/cs...Hardware-Based Speculation Combination of branch prediction, speculation, and dynamic scheduling Data flow execution: instruction](https://reader030.fdocuments.us/reader030/viewer/2022040601/5e92ba67661f8f371a469f21/html5/thumbnails/17.jpg)
Available ILP in Programs
Tom-catv
Doduc
Fpppp
Li
Espresso
Gcc
0 25 50 75 100 125 150 175
SP
EC
ben
chm
ark
Issues/Cycle
![Page 18: CS422 Computer Architecturebr/iitk-webpage/courses/cs...Hardware-Based Speculation Combination of branch prediction, speculation, and dynamic scheduling Data flow execution: instruction](https://reader030.fdocuments.us/reader030/viewer/2022040601/5e92ba67661f8f371a469f21/html5/thumbnails/18.jpg)
Window Size Limitation
Gcc Espresso Li Fpppp Doduc Tomcatv0
102030405060708090
100110120130140150
Infinite 512 128 32 8 4
Benchmark
Issu
es/C
ycle
![Page 19: CS422 Computer Architecturebr/iitk-webpage/courses/cs...Hardware-Based Speculation Combination of branch prediction, speculation, and dynamic scheduling Data flow execution: instruction](https://reader030.fdocuments.us/reader030/viewer/2022040601/5e92ba67661f8f371a469f21/html5/thumbnails/19.jpg)
Effect of Imperfect Branch Predictions
Gcc Espresso Li Fpppp Doduc Tomcatv0
5
10
15
20
25
30
35
40
45
50
55
60
65
Perfect Selector 2-bit Static None
Benchmark
Issu
es/C
ycle
![Page 20: CS422 Computer Architecturebr/iitk-webpage/courses/cs...Hardware-Based Speculation Combination of branch prediction, speculation, and dynamic scheduling Data flow execution: instruction](https://reader030.fdocuments.us/reader030/viewer/2022040601/5e92ba67661f8f371a469f21/html5/thumbnails/20.jpg)
Effect of Finite Virtual Register Set
Gcc Espresso Li Fpppp Doduc Tomcatv0
5
10
15
20
25
30
35
40
45
50
55
60
Infinite 256 128 64 32 None
Benchmark
Issu
es/C
ycle
![Page 21: CS422 Computer Architecturebr/iitk-webpage/courses/cs...Hardware-Based Speculation Combination of branch prediction, speculation, and dynamic scheduling Data flow execution: instruction](https://reader030.fdocuments.us/reader030/viewer/2022040601/5e92ba67661f8f371a469f21/html5/thumbnails/21.jpg)
A Realizable Processor
● Up to 64 instruction issues per cycle● Selective predictor with 1K entries, and a 16-
entry return predictor● Perfect memory disambiguation● Register renaming with 64 integer virtual
registers, and 64 FP virtual registers
![Page 22: CS422 Computer Architecturebr/iitk-webpage/courses/cs...Hardware-Based Speculation Combination of branch prediction, speculation, and dynamic scheduling Data flow execution: instruction](https://reader030.fdocuments.us/reader030/viewer/2022040601/5e92ba67661f8f371a469f21/html5/thumbnails/22.jpg)
ILP for a Realizable Processor
Gcc Espresso Li Fpppp Doduc Tomcatv0
5
10
15
20
25
30
35
40
45
50
55
60
Infinite 256 128 64 32 16 8 4
Benchmark
Issu
es/C
ycle