Out-of-Order Execution, Exception, Branch Prediction, CMP

Post on 16-Oct-2021

1 views 0 download

Transcript of Out-of-Order Execution, Exception, Branch Prediction, CMP

1

EE 457 Questions and Answers for Special Topics

Out-of-Order Execution, Exception,

Branch Prediction, CMP

Gandhi Puvvada, Weirong Jiang & Tony Toghia, USC 2008

Out of Order (OoO) ExecutionDynamic Scheduling of

Instructions(The Tomasulo Algorithm)

IntegerMultiplier

Issue UnitIn

t. D

ivid

er

63

2

TAG FIFO

Simplifiedfor EE457

Block Diagramprovided by Prof. Dubois

Mult

I -Cache

����

Dispatch

I-Fetch Queue

Integer Queue

Load/StoreQueue

Div

Queue

Mult Queue

CDB

Back-end

Front-end

Re-order Buffer

Reg File

BPB

Exe Unit Exe UnitCache

Exe Unit Exe Unit

����

Add Buff

OoO Execution and In-Order Committing with ROB (Re-Order Buffer)

Issue Unit

Q#1 What is the important difference between the two block diagrams?

Which supports precise exceptions

IntegerMultiplier

Issue Unit

Int.

Div

ider

63

2

TAG FIFO

A#1 ROB is the important difference between the two block diagrams.

The right-side block diagram supportsprecise exceptions.

IntegerMultiplier

Issue Unit

Int.

Div

ider

63

2

TAG FIFO

Q#2 Choose the right attributes to describe the block diagrams.

1. Left Block Diagram__________ (Out of Order / In-Order) Issue,__________ (Out of Order / In-Order) Execute,__________ (Out of Order / In-Order) Complete.

2. Right Block Diagram__________ (Out of Order / In-Order) Issue,__________ (Out of Order / In-Order) Execute,__________ (Out of Order / In-Order) Complete.

A#2 Choose the right attributes to describe the block diagrams.

1. Left Block Diagram__________ (Out of Order / In-Order) Issue,__________ (Out of Order / In-Order) Execute,__________ (Out of Order / In-Order) Complete.

2. Right Block Diagram__________ (Out of Order / In-Order) Issue,__________ (Out of Order / In-Order) Execute,__________ (Out of Order / In-Order) Complete.

9

Out-of-Order Execution (with ROB)Q#3 When we refer to an out-of-order

processor with ROB, do we mean:a. instructions are issued out-of-order?b. instructions start execution out-of-order?c. instructions finish execution out-of-order?d. instructions retire out of order?

• A#3: b and c. Instructions are issued and retired in-order, to maintain the functionality of in-order execution. What happens in between, however, the start and completion (of execution in integer and floating point units) of instructions, can be done out-of-order.

10

TAG FIFO (Token FIFO) in the left diagram

IntegerMultiplier

Issue Unit

Int.

Div

ider

63

2

TAG FIFO

Q#4 Q#4.1 Is it necessary to hold the 64 tokens in the 0 to 63 order initially on reset?Q#4.2 Is FIFO used for convenience or is it necessary that we follow the “First-In-First_Out orderQ#4.3 Can the FIFO overflow?Q#4.4 Can the FIFO become empty?

TAG FIFO (Token FIFO)A#4 A#4.1 It is not necessary to hold the 64 tokens in the 0 to 63 order initially on reset.

A#4.2 FIFO is used for convenience. It is not necessary that we follow the “First-In-First_Out” order.

A#4.3 The FIFO can not overflow as we can not receive more tokens than what we issued.

A#4.4 The FIFO can become empty if the backend capacity exceeds the total number of tokens.

Q#4 Q#4.1 Is it necessary to hold the 64 tokens in the 0 to 63 order initially on reset?

Q#4.2 Is FIFO used for convenience or is it necessary that we follow the “First-In-First_Out order

Q#4.3 Can the FIFO overflow?

Q#4.4 Can the FIFO become empty?

TAGs for destinations or sources or for both? (in ROB-less design)

• A new tag is assigned to the destination register of the instruction being dispatched.

• For each of the source registers (source operands) of the instruction being dispatched, either the value of the source register (if it has not been previously tagged) or the existing tag associated with the source register (if it has been tagged already in RAS) is conveyed to the instruction.

• If a tag is conveyed for a source, then the instruction needs to wait for the original instruction with that destination tag to go on to the CDB and announce the value.

Unique TAG

• Like SSN, we need a unique TAG

• SSNs are reused.

• Similarly TAGs can be reused.

• TAGs are similar to the number TOKENs.

4

4

(in ROB-less design)

TAGs (= Tokens)

• How many Tokens should the bank cashier have to start with?

• What happens if the tokens are run out?

• Does he need to have any order in holding tokens and issuing tokens?

• Does he have to collect tokens back?

4(in ROB-less design)

TAG FIFO (FIFOs are taught in EE560)

• To issue and collect Tokens (TAGs), use a circular FIFO (First-in-First-Out) unit.

• Filled with (say) 64 tokens (in any order) initially on reset.

• Tokens return in out of order anyway.• Put tokens back in stack and issue.

01

63

wp rp

2

Full

wp

rp

63

2

2 tokens issued

1

63

wprp2

1 token returned

(in ROB-less design)

17

• Q#5 What is meant by retirement in an out-of-order processor?

• Q#6 What two conditions are required for retirement?

• A#5: Retirement is the point at which an instruction’s results can be committed(can be written into the register file or memory) or if it is a conditional branch or an exception it can be taken. In short its execution is insured and it is no longer speculative. Note: In speculative execution, conditional branches are executed based on prediction, and if it turns out to be a misprediction, wrong-path instructions are flushed.

• A#6: Execution must be completed, and the instruction must be the oldest instruction not yet retired. (It is the oldest instruction in the re-order buffer.) 18

19

• Q#7 __________________ (Architectural / Physical) registers are visible to software (i.e. can be used in instructions)

• Q#8 __________________ (Architectural / Physical) registers allow multiple copies of a register to support out-of-order execution (including speculative execution) via register renaming.

20

• Q#7 __________________ (Architectural / Physical) registers are visible to software (i.e. can be used in instructions)

• Q#8 __________________ (Architectural / Physical) registers allow multiple copies of a register to support out-of-order execution (including speculative execution) via register renaming.

Limited Architectural RegistersMore Physical Registers

Register Renaminglw $8, 40($2);add $8, $8, $8;sw $8, 40($2);

lw $8, 60($3);add $8, $8, $8;sw $8, 60($3);

It is clear that compiler is using $8 as a temporary register.

If there is a delay in obtaining $2, the first part of the code can not proceed.

Unfortunately, the second part of the code can not proceed because of name dependency for $8.

22

Q#9 Register renaming can NOT solvea. RAW hazardsb. WAR hazardsc. WAW hazards

Note: In a design with ROB, WAW and WAR will never occur as all writes are performed strictly in-order. So answer the above question for the ROB-less design.

• A#9: a, The RAW (Read After Write) hazard is the only hazard which cannot be solved by register renaming.

• For WAW (Write After Write) hazard:– if the instruction order is that $1 gets written twice, and if the later

write (W2) can execute before the first write (W1), then register renaming mechanism allows the earlier write to be discarded in a ROB-less design.

• For WAR (Write After Read) hazard:– register renaming allows the older version of the register to be

read and held in the Issue Queues, so that the later write can proceed.

• For RAW (Read After Write) hazard:– a dependent read MUST wait and cannot execute before a write

to the same location. (The to-be written value must be determined before it can be read by a later instruction.) The dependent instruction waits in the Issue Queues for the operand to be broadcast on the CDB. 23

IntegerMultiplier

Issue Unit

Int.

Div

ider

63

2

TAG FIFO

24

Q#10 What resource is the major bottleneck of Tomasulo algorithm?

IFQ / Dispatcher / Issue Queues / Execution Units / CDB

25

A#10 What resource is the major bottleneck of Tomasulo algorithm?

CDB

The issue unit has to throttle issuing instructions to the execution units based on CDB’s availability. It does not let multiple execution units to finish execution at the same time.

26

• Q#11a Suppose the following lwinstruction is in progress and is currently waiting for the cache to respond. lw $2, 0($4)Which of the following instructions in the integer issue queue will begin execution the earliest?

#4 subi $6, $7, $8#3 addi $5, $3, $4#2 sub $4, $4, $6#1 (oldest)

add $1, $2, $3

27

• A#11a #2. #1 cannot begin execution, because it reads $2, which is still being written by the LW instruction (RAW hazard). Instruction #2 can begin execution. (Note: Register renaming solves the WAR hazard on $4.)

#4 subi $6, $7, $8#3 addi $5, $3, $4#2 sub $4, $4, $6#1 (oldest)

add $1, $2, $3

28

• Q#11b Given the same situation (lw $2, 0($4) ) as the previous problem, now which of the following instructions in the integer issue queue will begin execution the earliest?

#4 subi $6, $7, $8#3 addi $5, $3, $4#2 sub $4, $4, $1#1 (oldest)

add $1, $2, $3

Was $6

29

• A#11b Instruction #4 is the earliest instruction that does not read a value that is modified by an earlier instruction.

#4 subi $6, $7, $8#3 addi $5, $3, $4#2 sub $4, $4, $1#1 (oldest)

add $1, $2, $3

Was $6

Without or with ROB? • Q#11c Are your answers to Q#11a and

Q#11b for the first design without ROB or the second design with ROB?

Without or with ROB? • Q#11c Are your answers to Q#11a and

Q#11b for the first design without ROB or the second design with ROB?

• A#11c For both! RAW dependency is the true dependency and every implementation has to honor that dependency.

Q#12 ROB is the important difference between the two block diagrams.

Compare and contrast

IntegerMultiplier

Issue Unit

Int.

Div

ider

63

2

TAG FIFO

A#12 Compare and contrastWithout ROB With ROB

1. TAG FIFO provides unique TAGs

1. ROB location IDs are TAGs

2. Register Status Table specifies if a register is obsolete.

2. ROB needs to be searched associatively to find the latest register content

3. Allows out-of-order completion

3. Enforces in-order-only completion

A#12 Compare and contrastWithout ROB With ROB

4. Can not support exceptions

4. Can support exceptions

5. Can not support speculative execution.

5. Can support speculative execution.

6. No speculation,No BPB.

6. Has BPB to aid in branch prediction

7. No good for real implementation

7. Good for real implementation

A#12 Compare and contrastWithout ROB With ROB

8. Writes are out of order. Hence dispatch is suspended after dispatching a conditional branch, until the branch is resolved.

8. Writes are in-order. Dispatch continues based on prediction. Design provides for flushing wrong-path execution.

9. Stores write to cache when they come out of lsq (load/store queue).

9. Stores write to cache when they reach the top of ROB.

A#12 Compare and contrastWithout ROB With ROB

10. Memory disambiguation rules are stricter.

10. Since WAW and WAR are not present, rules are simpler.

11. Only RAR is irrelevant. So two loads from the same address can execute in any order. Rest of loads and stores with matching addresses have go in-order.

11. Only RAW needs to be looked at. Loads read cache before going into ROB. Hence, loads have to wait until senior stores with matching addresses finish

A#12 Compare and contrastWithout ROB With ROB

12. Suppose a senior load is yet to calculate its memory address.A junior load (but not store) can leave LSQ. (No RAR, but WAR).Suppose a senior store is yet to calculate its memory address.A junior load/store can not leave. (RAW, WAW)

12. Stores leave a copy of their address in Address Buffer near LSQ, so that junior loads can figure out (without looking up the ROB) if they can read cache. It means junior stores, with a senior load yet to calculate address, can not leave LSQ. It means, junior stores with address matching to a senior load should not leave LSQ. Or they can leave if senior loads with matching address make a note of this.

38

Exceptions

• Q#1 What is the definition of an exception?

• Q#2 What is the difference between asynchronous and synchronous exceptions? Give two examples of each.

• Q#3 Precise exceptions are _______________ (synchronous, asynchronous ) and the excepting instruction _________ (must be/does not need to be) re-executed .

• A#1: Exceptions are very rare events forcing a transfer of program control to a software handler.

• A#2: Synchronous exceptions are triggered by specific instructions (e.g. Divide by zero, illegal instruction, page fault, etc.). Asynchronous exceptions include the hardware interrupts and are not tied to a specific executing instruction (e.g. keyboard interrupt, real-time clock, power failure)

• A#3: Precise exceptions are (synchronous, asynchronous ) and the excepting instruction (must be/does not need to be) re-executed (e.g. in the case page fault, ....).

39

40

Q#4• Interrupts are ___________

(Asynchronous/Synchronous) to program execution.

• Traps are ___________ (Asynchronous/Synchronous) to program execution.

41

A#4• Interrupts are ___________

(Asynchronous/Synchronous) to program execution. Example: Keyboard interrupt.

• Traps are ___________ (Asynchronous/Synchronous) to program execution. Example: addition overflow trap.

42

Q#5• Match the exceptions with the 5 pipeline

stages

IF ID EX MEM WB

Page Fault

Integer Overflow

Undefined Opcode

Memory Protection Violation

43

A#5• Match the exceptions with the 5 pipeline

stages

IF ID EX MEM WB

Page Fault X X

Integer Overflow X

Undefined Opcode X

Memory Protection Violation

X X

44

Q#6 For precise exceptions, the exceptions should be taken in

a. process orderb. temporal order

45

Q#6 For precise exceptions, the exceptions should be taken in

a. process orderb. temporal order

• A#6: Process order. Exceptions on earlier instructions must be handled before exceptions due to later instructions, regardless of when they are detected.

46

Q#7• For precise exceptions in the 5-stage

pipeline, an exception should be taken in which stage? Why?

• A#7: WB Stage. This is to insure that no earlier instruction in program order triggers an exception.

Well, as discussed in our class, an exception can be taken in MEM stage (instead of the WB stage) as the instruction in the WB stage would not cause a new exception.

47

48

Q#8• What are the functions of the Cause

Register and Exception PC (EPC)?

49

Q#8• What are the functions of the Cause

Register and Exception PC (EPC)?

• A#8: Cause register records what type of exception occurred, and the EPC tells the exception handler on which instruction the exception occurred.

50

Q#9 What are the requirements of precise exception handling in a pipelined processor?

51

Q#9 What are the requirements of precise exception handling in a pipelined processor?

A#9: All preceding instructions in process order must complete.All instructions following the faulting instruction plus the faulting instruction itself must be squashed.The execution of the handler must be started.

52

• Q#10

53

First run (before first exception handled)

54

Second run (after page fault handled)

55

A#10: First run (before first exception handled)

IF ID EX MEM WB

Cycle 1 SW Illegal –Exception Detected

ADD LW –Exception Detected

Cycle 2 Start of Exception Handler

NOP NOP NOP NOP (Exception)

56

A#10: Second run (after page fault handled)

IF ID EX MEM WB

Cycle 1

SW Illegal –Exception Detected

ADD LW

Cycle 2

NOP NOP NOP (Exception)

ADD LW

Cycle 3

NOP NOP NOP NOP (Exception)

ADD

Cycle 4

Start of Exception Handler

NOP NOP NOP NOP (Exception)

57

Branch PredictionQ#1 Which types of branches need

prediction?a. Indirect branch due to return from

function callb. Conditional branchc. Unconditional branch

58

Branch PredictionQ#1 Which types of branches need

prediction (direction prediction)?a. Indirect branch due to return from

function callb. Conditional branchc. Unconditional branch

A#1: Conditional branch

59

The misprediction rate (increases/decreases/stays the same) if the loop is re-executed.

branchPCBranch Prediction Buffer

N T

Q#2 Given a simple 1-bit (2-state) pattern history predictor, assuming the initial branch is predicted not taken what is the misprediction rate for the following loop? (Assume there are no other branches in the loop):

for (i=0; i<4, i++)

60

The misprediction rate stays the same for all subsequent runs of the loop.

branchPCBranch Prediction Buffer

N T

A#2 The predictor will predict the 1st branch not taken, and it will predict the 2nd, 3rd, 4th, and 5th branches taken. The 1st and last predictions will be incorrect. So, the misprediction rate is 40%.

for (i=0; i<4, i++)

I 0 1 2 3 4

Pred N T T T T

Examples

DC08: TTTTTTTTTTT ... TTTTTTTTTTNTTTTTTTTT …

100,000 iterations

How often is branch outcome != previous outcome?2 / 100,000

TNNT

DC44: TTTTT ... TNTTTTT … TNTTTTT …

2 / 100

DC50: TNTNTNTNTNTNTNTNTNTNTNTNTNTNT …

2 / 2

99.998%Prediction

Rate98.0%

0.0%

© Murali Annavaram, Gabe Loh & Gary Tyson, All rights reserved

Brandon Franzke, USC 2006 62

Use two bit history• 2-bit history

– Start as strongly not taken – Update BPB after every branch execution

branchPC

SN N

Branch Prediction Buffer

T ST

© Murali Annavaram, Gabe Loh & Gary Tyson, All rights reserved

TWO-BIT PREDICTOR

2-BIT UP-DOWN SATURATING COUNTER IN EACH ENTRY OF THE BPB

TAKEN==> ADD 1; UNTAKEN: SUBTRACT 1NOW IT TAKES 2 MISPREDICTIONS IN A ROW TO CHANGE THE PREDICTIONFOR THE NESTED LOOP, THE MISPRECTION AT ENTRY IS AVOIDED

COULD HAVE MORE THAN 2-BITS, BUT TWO BITS COVER MOST PATTERNS (LOOPS)

00Predict U

10Predict T

01Predict U

11Predict T

T

U T

U

T U

T

U

U: UntakenT: Taken

SN N

TST

SN

N

T

ST

Strongly Not Taken

Not Taken

Taken

Strongly Taken

SN N T ST

EE557 Michel Dubois USC 2007

64

• Q#3 Show the states and predictions for 2 runs of the loop shown in Q#2 using the 2-bit pattern history predictor?

First run: Second run:Iteration 0 1 2 3 4

Actual T T T T N

State

Prediction N

Iteration 0 1 2 3 4

Actual T T T T N

State

Prediction

SN N T ST

SN

65

• A#3 The 2-bit predictor works better than the 1-bit predictor after the initial training period.We can improve the initial training period by starting in the state.

First run: Second run:Iteration 0 1 2 3 4

Actual T T T T N

State

Prediction N N T T T

Iteration 0 1 2 3 4

Actual T T T T N

State

Prediction T T T T T

SN N T ST

SN N T ST ST T ST ST ST ST

T

66

Q#4 (Global / Local) predictors make use of the PC, while (global / local) predictors do not.

67

A#4 (Global / Local) predictors make use of the PC, while (global / local) predictors do not.

A#4 Local (also known as per-address) predictors, make use of the PC to distinguish between different branch instructions. Global predictors do not.

Correlating Branches

(2,2) predictor– Behavior of recent

branches selects between four predictions of next branch, updating just that prediction

Branch address

2-bits per branch predictor

Prediction

2-bit global branch history

4

CS252 UC Berkeley David A. Patterson

69

• Q#5 Two-Level Prediction:• Given the following branch history / pattern

history predictor:– 2-bit global branch history register (Shift-Left)– 3-bits of PC used to access pattern history table.– All predictors are 2-bits Predictors.– Instruction width = 32-bits– Assume the next branch instruction is at PC = 8004,

and it will be taken eventually.• On the following page:

– Provide the bits of the PC used by the predictor.– Indicate if the prediction is taken/not taken.– Show any changes to the branch history register and

pattern history table after the branch taken outcome info is provided.

700 1

00 10 11 10

11 10 01 01

01 01 01 11

00 01 00 10

00 10 11 10

11 10 01 01

01 01 01 11

00 01 00 10

PC A__ - A__

00 11

000

111BHR

Pattern History Table

01 10

710 1

00 10 11 10

11 10 01 01

01 01 01 11

00 01 00 10

00 10 11 10

11 10 01 01

01 01 01 11

00 01 00 10

PC A 4 - A 2

00 11

000

111BHR

Pattern History Table

01 10

001

A#5: 8004H => 00110 => Predict T (Taken)

This branch is taken as predicted eventually. Hence•Branch History Register shifts left from 01 to 11.•Pattern changes from state 10 to state 11 (refer to the 2-bit predictor state diagram).

Shift in a 1

72

Q#6 Is the following statement true or false? Explain.

“A predictor with more bits can always achieve a better performance”

73

Q#6 Is the following statement true or false? Explain.

“A predictor with more bits can always achieve a better performance”

A#6 : No. More bits can often just increase training time, which will reduce the accuracy for shorter loops. Also more bits mean more hysteresis which in turn means “refusing” to “adopt” or “change”.

Q#7 With a branch target buffer, the address of the next instruction can be predicted while the branch is in _____ (IF/ID/EX/MEM/WB) stage.

75

Q#7 With a branch target buffer, the address of the next instruction can be predicted while the branch is in _____ (IF/ID/EX/MEM/WB) stage.

76

A#7: IF Stage. The branch target buffer compares the PC against the known predicted taken branches and supplies the next address. Since only the PCs are being compared, the instruction does not have to be decoded. For accurately predicted branches, this results in zero clock penalty.

77

CMPQ#1 Uniprocessor pipelines (with no

multithreading) are constrained by ___________ level parallelism

Q#2 Dynamic power considerations favors ____(Uniprocessor / Parallel Processor)

78

CMPA#1 Uniprocessor pipelines (with no

multithreading) are constrained by instruction level parallelism (ILP)

A#2 Dynamic power considerations favors ____(Uniprocessor / Parallel Processor)

79

Q#3a Which types of processor multithreading need context switch through Process Control Block?

a. Software multithreadingb. Hardware multithreading

Q#3b Which has high over-head of context switching?

a. Software multithreadingb. Hardware multithreading

80

A#3a Which types of processor multithreading need context switch through Process Control Block?

a. Software multithreadingb. Hardware multithreading

A#3b Which has high over-head of context switching?

a. Software multithreadingb. Hardware multithreading

81

Q#4 Does Niagara have the cache coherence issue? If Yes, in which level of cache?

82

Q#4 Does Niagara have the cache coherence issue? If Yes, in which level of cache?

A#4: Yes, in L1 cache since it’s not shared.

83

Q#5a Is L1 cache shared across cores?

Q#5b Is L1 cache shared (used) by the different threads running on a single core?

84

Q#5a Is L1 cache shared across cores?

No.

Q#5b Is L1 cache shared (used) by the different threads running on a single core?

Yes.

85

• Q#6 Uniprocessors place greater burden on (hardware / software) designers, while parallel processors place greater burden on (hardware / software) designers.

86

• Q#6 Uniprocessors place greater burden on (hardware / software) designers, while parallel processors place greater burden on (hardware / software) designers.

• A#6 Uniprocessors place greater burden on (hardware / software) designers, while parallel processors place greater burden on (hardware / software) designers.