A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential...

87
A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution – Interruptions Out-of-Order Execution How it can help – Issues: • Maintaining Sequential Semantics • Scheduling – Scoreboard Register Renaming Initially, we’ll focus on Registers, Memory later on

Transcript of A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential...

Page 1: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Instruction Level Parallel Processing• Sequential Execution Semantics• Superscalar Execution

– Interruptions• Out-of-Order Execution

– How it can help– Issues:

• Maintaining Sequential Semantics• Scheduling

– Scoreboard• Register Renaming

• Initially, we’ll focus on Registers, Memory later on

Page 2: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Sequential Semantics - Review

• Instructions appear as if they executed:– In the order they appear in the program– One after the other

• Pipelining: Partial Overlap of Instructions– Initiate one instruction per cycle– Subsequent instructions overlap partially– Commit one instruction per cycle

Page 3: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Can we do better than pipelining?

loop: ld r2, 10(r1)add r3, r3, r2sub r1, r1, 1bne r1, r0, loop

Pipelining:

sum += a[i--]

fetch decode ldfetch decode add

fetch decode subfetch decode bne

time

Superscalar:fetch decode ld

fetch decode addfetch decode sub

fetch decode bne

Page 4: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Superscalar - In-order (initial def.)

• Two or more consecutive instructions (in the original program order) can execute in parallel

• Is this much better than pipelining?– What if all instructions were dependent?

• Superscalar buys us nothing

• Again key is typical program behavior– Some parallelism exists– Pipelining “drains” on dependences– Superscalar consumes “fill-up” time

Page 5: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Practicalities

• Issue mechanism– At decode check:

• Dependences • Input operand availability

– Check against Instructions:• Simultaneously Decoded• In-progress in the pipeline (i.e., previously issued)

– Recall the register vector from pipelining

• Increasingly Complex with degree of superscalarity– 2-way, 3-way, …, n-way

Page 6: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Issue Rules

• Stall at decode if:– RAW dependence and no data available

• Source registers against previous targets

– WAR or WAW dependence• Target register against previous targets + sources

– No resource available• This check is done in program order

Page 7: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Issue Mechanism

• Assume 2 source & 1 target max per instr.– comparators for 2-way:

• 3 for tgt and 2 for src (tgt: WAW + WAR, src: RAW)

– comparators for 4-way:• 2nd instr: 3 tgt and 2 src• 3rd instr: 6 tgt and 4 src• 4th instr: 9 tgt and 6 src

tgt src1 src1

tgt src1 src1

tgt src1 src1

� simplifications may be possible� resource checking not shown

Pro

gra

m o

rder

Page 8: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Implications

• Need to multiport some structures– Register File

• Multiple Reads and Writes per cycle

– Register Availability Vector• Multiple Reads and Writes per cycle

– From Decode and Commit– Also need to worry about WAR and WAW

• Resource tracking– Additional issue conditions

• Many Superscalars had additional restrictions– E.g., execute one integer and one floating point op– one branch, or one store/load

Page 9: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Preserving Sequential Semantics• In principle not much different than

pipelining• Program order is preserved in the pipeline• Some instructions proceed in parallel

– But order is clearly defined• Defer interrupts to commit stage (i.e.,

writeback)– Flush all subsequent instructions

• may include instructions committing simultaneously

– Allow all preceding instructions to commit• Recall comparisons are done in program

order• Must have sufficient time in clock cycle to

handle these

Page 10: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Interrupts Example

fetch decode ldfetch decode addfetch decode div

fetch decode bne

Exceptionraised

Exceptiontaken

fetch decode bne

fetch decode ldfetch decode addfetch decode div

fetch decode bne

Exceptionraised

Exceptiontaken

fetch decode bne

Exceptionraised

Page 11: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Superscalar vs. Pipelining

• In principle they are orthogonal– Superscalar non-pipelined machine– Pipelined non-superscalar– Superscalar and Pipelined (common)

• Additional functionality needed by Superscalar:– Another bound on clock cycle– At some point it limits the number of

pipeline stages

Page 12: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Superscalar vs. Superpipelining• Superpipelining:

– Vaguely defined as deep pipelining, i.e., lots of stages• Superscalar issue complexity: limits super-

pipelining• How do they compare?

– Conceptually, not by much.fetch decode instfetch decode inst

fetch decode instfetch decode inst

fetch decode instfetch decode inst

fetch decode instfetch decode inst

Page 13: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Case Study: Alpha 21164

Page 14: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

21164: Int. Pipe

Page 15: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

21164: Memory Pipeline

Page 16: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

21164: Floating-Point Pipe

Page 17: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

80486 Pipeline• Fetch

– Load 16-bytes from into prefetch buffer• Decode 1

– Determine instruction length and type• Decode 2

– Compute memory address– Generate immediate operands

• Execute– Register Read– ALU– Memory read/write

• Write-back– Update register file

• (source: CS740 CMU, ’97, all slides on 486)

Page 18: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

80486 Pipeline detail• Fetch

– Moves 16 bytes of instruction stream into code queue

– Not required every time– About 5 instructions fetched at once (avg. length 2.5

bytes)– Only useful if don’t branch– Avoids need for separate instruction cache

• D1– Determine total instruction length– Signals code queue aligner where next instruction

begins– May require two cycles

• When multiple operands must be decoded• About 6% of “typical” DOS program

Page 19: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

80486 Pipeline• D2

– Extract memory displacements and immediate operands

– Compute memory addresses– Add base register, and possibly scaled index register– May require two cycles

• If index register involved, or both address & immediate operand

• Approx. 5% of executed instructions• EX

– Read register operands– Compute ALU function– Read or write memory (data cache)

• WB– Update register result

Page 20: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Out-of-Order Execution

• Also known as dynamic scheduling– Compilers do static scheduling

• We will start by considering register only– Register interface helps a lot

• Later on we will expand to memory– Tricky: Memory interface is more powerful

than registers – Makes it harder to figure out dependences

• In principle the same method will be used for both

Page 21: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Beyond Superscalar Execution

do {sum += a[++m]; i--;

} while (i != 0);

out-of-order

loop: add r4, r4, 1ld r2, 10(r4)add r3, r3, r2sub r1, r1, 1bne r1, r0, loop

Superscalar:

fetch decode

fetch decode

sub

bne

fetch decode add

fetch decode ld

fetch decode add

fetch decode

fetch decode

sub

bne

fetch decode add

fetch decode ld

fetch decode add

Page 22: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

fetch decode

fetch decode

sub

bne

fetch decode add

fetch decode ld

fetch decode add

Sequential Semantics?

• Execution does NOT adhere to sequential semantics

• To be precise: Eventually it may• Simplest solution: Define problem away

– Imprecise interrupts• On interrupt some instr. committed some not• software we’ll have to figure out what is going on• Horrible for debugging and programming

inconsistent

consistent

Page 23: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Interrupt

• Recall we use the term interrupt to signify the need to observe the machine’s state after any instruction– This can be indeed the result of interrupt in

the classical sense– But, it could be a debugger (still uses

interrupts)

Page 24: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Out-of-Order vs. Pipelining and Superscalar

• Definition: two or more instructions can execute in any order if they have no dependences (RAW, WAW, WAR)– beware of transitive dependences

• Is this better than pipelining or superscalar exec?– If all are independent: not– if all dependent: not– Programs have some parallelism– Pipelining “drain” and “fill-up” overheads– Superscalar, parallelism only when adjacent– OoO exploits par. even when not adjacent

• OoO Orthogonal to pipelining and Superscalar

Page 25: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Out-of-order Execution Issues

• Preserving Sequential Semantics

• Stalling Instructions w/ dependences

• Issuing Instructions when dependences are satisfied

Page 26: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Back to Sequential Semantics

• Instr. exec. in 3 phases:– In-progress, Completed (NEW), Committed– OOO for in-progress and Completed– In-order Commits

• Completed - out-of-order: ”Visible only inside”– Results visible to subsequent instructions– Results not visible to outsiders

• On interrupts completed results are discarded• Committed - in-order: ”Visible to all”

– Results visible to subsequent instructions– Results visible to outsiders

• On interrupt committed results are preserved

Page 27: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

How Completes Help w/ Performance

Tim

e

DIV R3, _, _ADD R1, _, _ADD _, R1, _

In-ordercommits

in-ordercompletes

out-of-order completesin-order commits

complete

fetch decode

fetch decode

sub

bne

fetch decode add

fetch decode ld

fetch decode add

commit

commit

commit

commit

commit

Page 28: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Implementing Completes/Commits

• Key idea:– Maintain sufficient state around to be able to

roll-back when necessary– Roll-back:

• Discard (aka Squash) all not committed

• One solution (conceptual):– Upon Complete instruction records previous

value of target register– Upon Discard, instruction restores target

value– Upon Commit, nothing to do

• We will return to this shortly • Focus on scheduling mechanisms

Page 29: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Out-of-Order Execution the Big PictureProgram Form Processing Phase

Static program

dynamic inst.Stream (trace)

execution window

completed instructions

Dispatch/ dependences

inst. Issue

inst execution

inst. Reorder & commit

Page 30: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Out-of-Order Execution: Stages

• Fetch: get instruction from memory• Decode/Dispatch: what is it? What are the

dependences• Issue: Go – all dependences satisfied• Execute: perform operation• Complete: result available to other insts.• Commit: result available to outsiders

• We’ll start w/ Decode/Dispatch• Then we’ll consider Issue

Page 31: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

OOO Scheduling• Instruction @ Decode:

– Do I have dependences yet to be satisfied?– Yes, stall until they are– No, clear to issue

• Wakeup Instructions Stalled:– Dependences satisfied– Allow instruction to issue

• Dependence:– (later instruction, earlier instruction) & type

• We’ll first consider RAW and then move on to WAW and WAR

Page 32: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Stalling @ Decode for RAW

• Are there unsatisfied dependences?– RAW: have to wait for register value– We don’t really care who is producing the

value– Only whether it is available

• Can use the Register Availability Vector as in pipelining/superscalar– Also known as scoreboard

• At Decode– Reset bit corresponding to your target– At writeback set– Check all bits for source regs: if any is 0 stall

Page 33: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Issuing Instructions: Scheduling• Determine when an instruction can issue

– Ignore resources for the time being• Stalled because of RAW w/ preceding instruction• Concept:

– Producer (write) notifies consumers (read)• Requirements:

– Consumers need to be able to identify producer– The register name is one possible link

• Mechanism– Consumer placed in a reservation station – Producers on complete broadcasts identity– Waiting instructions observe– Update Operand Availability – Issue if all operands now available

Page 34: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Reservation Station

• State pertaining to an instruction– What registers it reads– Whether they are available– What is the destination register– What state is the instruction in

• Waiting• Executing

Page 35: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Out-Of-Order Exec. Exampleloop: add r4, r4, 4

ld r2, 10(r4) 4 cycles latadd r3, r3, r2sub r1, r1, 1bne r1, r0, loop

1 1 1 1

r1 r2 r3 r4

RAVop src1 src2 tgt

Cycle 0

status

Page 36: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Out-Of-Order Exec. Example: Cycle 0

1 1 1 0

r1 r2 r3 r4

RAVop src1 src2 tgt

Cycle 0

add r4/1 NA/1 r4/0 Rdy

status

loop: add r4, r4, 4ld r2, 10(r4) 5 cycles latadd r3, r3, r2sub r1, r1, 1bne r1, r0, loop

Ready to be executed

Page 37: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Cycle 1loop: add r4, r4, 4

ld r2, 10(r4)add r3, r3, r2

sub r1, r1, 1

bne r1, r0, loop

1 0 1 1

r1 r2 r3 r4

RAVop src1 src2 tgt

add r4/1 NA/1 r4 Exec

status

ld r4/1 NA/1 r2 RdyR4 gets produced now

Notify those waiting for R4

Page 38: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Cycle 2loop: add r4, r4, 4

ld r2, 10(r4)add r3, r3, r2

sub r1, r1, 1

bne r1, r0, loop

1 0 0 1

r1 r2 r3 r4

RAVop src1 src2 tgt

add r4/1 NA/1 r4 Cmtd

status

ld r4/1 NA/1 r2 ExecWait for r2

Result available @ cycle 6

add r3/1 r2/0 r3 Wait

Page 39: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Cycle 3loop: add r4, r4, 4

ld r2, 10(r4)add r3, r3, r2

sub r1, r1, 1

bne r1, r0, loop

0 0 0 1

r1 r2 r3 r4

RAVop src1 src2 tgt

add r4/1 NA/1 r4 Cmtd

status

ld r4/1 NA/1 r2 ExecWait for r2

Result available @ cycle 6

add r3/1 r2/0 r3 Wait

sub r1/1 NA/1 r1 RdyNo dependences

Page 40: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Cycle 4loop: add r4, r4, 4

ld r2, 10(r4)add r3, r3, r2

sub r1, r1, 1

bne r1, r0, loop

1 0 0 1

r1 r2 r3 r4

RAVop src1 src2 tgt

add r4/1 NA/1 r4 Cmtd

status

ld r4/1 NA/1 r2 ExecWait for r2

Result available @ cycle 6

add r3/1 r2/0 r3 Wait

sub r1/1 NA/1 r1 Execr1 produced nowNotify consumers

bne r1/1 r0/1 NA Rdyr1 will be available next cycle

Page 41: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Cycle 5loop: add r4, r4, 4

ld r2, 10(r4)add r3, r3, r2

sub r1, r1, 1

bne r1, r0, loop

1 0 0 1

r1 r2 r3 r4

RAVop src1 src2 tgt

add r4/1 NA/1 r4 Cmtd

status

ld r4/1 NA/1 r2 ExecWait for r2

Result available @ cycle 6

add r3/1 r2/0 r3 Wait

sub r1/1 NA/1 r1 ComplCompleted

bne r1/1 r0/1 NA Execexecuting

Page 42: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Cycle 6loop: add r4, r4, 4

ld r2, 10(r4)add r3, r3, r2

sub r1, r1, 1

bne r1, r0, loop

1 1 0 1

r1 r2 r3 r4

RAVop src1 src2 tgt

add r4/1 NA/1 r4 Cmtd

status

ld r4/1 NA/1 r2 ExecWait for r2

Result available @ cycle 6Notify consumers

add r3/1 r2/1 r3 Rdy

sub r1/1 NA/1 r1 ComplCompleted

bne r1/1 r0/1 NA Execexecuting

Page 43: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Cycle 7loop: add r4, r4, 4

ld r2, 10(r4)add r3, r3, r2

sub r1, r1, 1

bne r1, r0, loop

1 1 1 1

r1 r2 r3 r4

RAVop src1 src2 tgt

add r4/1 NA/1 r4 Cmtd

status

ld r4/1 NA/1 r2 CmtdExecuting

Notify consumers

add r3/1 r2/1 r3 Exec

sub r1/1 NA/1 r1 Compl

Completedbne r1/1 r0/1 NA Compl

Page 44: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Cycle 8loop: add r4, r4, 4

ld r2, 10(r4)add r3, r3, r2

sub r1, r1, 1

bne r1, r0, loop

1 1 1 1

r1 r2 r3 r4

RAVop src1 src2 tgt

add r4/1 NA/1 r4 Cmtd

status

ld r4/1 NA/1 r2 Cmtd

add r3/1 r2/1 r3 Cmtd

sub r1/1 NA/1 r1 Cmtd

bne r1/1 r0/1 NA Cmtd

Page 45: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Notifying Consumers

• Identity of Producer• Uniquely Identify the Instruction• Easily retrievable @ decode by others

– Target Register• Recall we stall on WAR or WAW

– Functional Unit • If not pipelined

– Place in instruction window– PC? not. Why?

Page 46: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Name Dependences and OOO

• WAW or WAR: We need to update register but others are still using it– add r1, r1, 10– sw r1, 20(r2)– add r1, r3, 30– sub r2, r1, 40

• There is only one r1– sw needs to see the value of 1st add– sub needs to wait for 2nd add and not 1st

• Solution: Stall decode when WAW or WAR

Page 47: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Detecting WAW and WAR

• WAW? Look at Scoreboard– If bit is 0 then there is a pending write– Stall

• WAR? Need to know whether all preceding consumers have read the value– Keep a count per register– Increase at decode for all reads– Decrease on issue

• More elegant solution via register renaming– Soon

Page 48: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Window vs. Scheduler• Window

– Distance between oldest and youngest instruction that can co-exist inside the CPU

– Larger window Potential for more ILP• Scheduler

– Number of instructions that are waiting to be issued

• Window– Instructions enter at Fetch– Exit at Commit

• Scheduler– Instructions enter at Decode– Leave at writeback

• Window >= Scheduler– Can be the same structure

• In window but not in scheduler completed

inst

ruct

ion

s

Page 49: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Scoreboarding• Schedule based on RAW dependences• WAW and WAR cause stalls

– WAW at decode– WAR at writeback

• Optimization: Why is this OK?

• Implemented in the CDC 6600 in ‘64– 18 non-pipelined FUs

• 4 FP: 2 mul, 1 add, 1 div• 7 MEM: 5 load, 2 store• 7 INT: add, shift, logical etc.

• Centralized Control Scheme– Controls all Instruction Issue– Detects all hazards

Page 50: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

MIPS/DLX w/ Scoreboarding

RegisterFile

FP mul

FP mul

FP divide

FP add

FP integer

scoreboard

Page 51: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Scoreboarding Overview• Ignore IF and MEM for simplicity• 4-stage execution

– Issue Check for structural hazardsCheck for WAW hazardsStall until all clear

– ReadOp Check for RAW hazardsWait until all operands readyRead Registers

– Execute Execute OperationsNotify scoreboard when complete

– Write Check for WAR hazardsStall Write until all clear

• A completing instruction cannot write dest if an earlier instruction has not read dest.

Page 52: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Scoreboarding Optimizations/Tricks

• WAW as in original OOO• WAR is optimized

– Second Producer is allowed to execute up to complete

– It is stalled there until preceding consumers complete

• No Commit– No precise interrupts

• Window is implemented in the scoreboard• One entry per Functional Unit

– Recall not pipelined– Instructions identified by FU id

Page 53: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Scoreboarding Organization• Three structures

– Instruction Status– Functional Unit Status– Register Result Status

• Instruction Status– Which stage the instruction is currently in

• Functional Unit Status: scheduling– Busy– OP Operation– Fi Dest. Reg.– Fj, Fk Source Regs– Qj, Qk FUs producing sources– Rj, Rk Ready bits for sources

• Register Result Status: dep. determination– Which FU will produce a register

Page 54: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Scoreboarding explained

• Register status reg:– Which FU produces the register

• Use at decode– Source reg match is a RAW– Target reg macth is a WAW stall

Page 55: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Functional Unit Status• Busy:

– resource allocation• OP:

– what to do once issued (e.g., add, sub)• Dest. Reg.:

– Where to write result– To find WAR

• Fj, Fk Source Regs– for WAR: can’t write if consumers pending for

previous value of register (if FU not the same)• Qj, Qk FUs producing sources

– To wait for appropriate producer• Rj, Rk Ready bits for sources

– To determine when ready: all ready

Page 56: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Scoreboarding ExampleInstruction status Read Execution WriteInstruction j k Issue operandscomplete ResultLD F6 34+ R2LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Functional Unit Statusdest S1 S2 FU for j FU for k Fj? Fk?

Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger NoMult1 NoMult2 NoAdd NoDivide No

Register result status

ClockF0 F2 F4 F6 F8 F10 F12 ... F30

FU

Page 57: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Example: Cycle 0Instruction status Read Execution WriteInstruction j k Issue operandscomplete ResultLD F6 34+ R2 1LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Functional Unit Statusdest S1 S2 FU for j FU for k Fj? Fk?

Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger yes LD F6Mult1 NoMult2 NoAdd NoDivide No

Register result status

ClockF0 F2 F4 F6 F8 F10 F12 ... F30

FU integer

Page 58: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Example, contd.

• The rest you’ll find on the web site• Go through it• Source: Patterson

• Summary:– Execution proceeds in an order dictated by

dependences– RAW, WAR and WAW force ordering– Tricks may be possible

Page 59: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Beyond Simple OoO

A B

CD

E

A: LF F6, 34(R2)B: LF F2, 45(R3)C: MULF F0, F2, F4D:SUBF F8, F2, F6E: ADDF F2, F7, F4

• E will wait for B, C and D. • WAR w/ C and D• WAW w/ B• Can we do better?

Page 60: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

What if we had infinite registersA: LF F6, 34(R2)B: LF F2, 45(R3)C: MULF F0, F2, F4D:SUBF F8, F2, F6E: ADDF F2, F7, F4

A: LF F6, 34(R2)B: LF F2, 45(R3)C: MULF F0, F2, F4D:SUBF F8, F2, F6E: ADDF F9, F7, F4

No false dependences anymoreSince we do not reuse a name we can’t have WAW

and WAR

Page 61: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Why we can’t have Infinite Registers

• False/Name dependences (WAR and WAW)– Artifact of having finite registers

• There is no such thing as infinite• There is no such thing as large enough

– Well there is (in a sec.)– Computers execute Billions of Instructions

per sec. Even a multi-billion register file would soon be exhausted

• Want to exploit parallelism across several instances of the same code– Loops, recursive functions (most frequent

part)

Page 62: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Yes, there is “large enough”

� At any given point there will be a finite number of instructions in the window

� if each instruction has a single register target

� if there are N instructions� How many registers do we need?

� N?� N + X?

Page 63: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Register Renaming• Register Version

– Every Write creates a new version– Uses read the last version– Need to keep a version until all uses have read it.

• Register Renaming:– Architectural vs. Physical Registers

• more phys. than arch.– Maintain a map of arch. to phys. regs.– Use in-order decoding to properly identify

dependences.– Instructions wait only for input op. availability.– Only last version is written to reg. file.

Page 64: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Register RenamingA: DIVF F3, F1, F0 r1, -, -B: SUBF F2, F1, F0 r2, -, -C: MULF F0, F2, F4 r3, r2, -D: SUBF F6, F2, F3 r4, r2, r1E: ADDF F2, F5, F4 r5, -, -F: ADDF F0, F0, F2 r6, r3, r5

Register Rename TableF0 F1 F2 F3 F5 F6 F7 ... F30

A R1B R2 R1C R3 R2 R1D R3 R2 R1 R4E R3 R5 R1 R4F R6 R5 R1 R4

Need more physical registers than architecturalIgnore control flow for the time being.

Page 65: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Register Renaming Process

• Only need to remember last producer of each architectural register– Vector

• At decode– Find the most recent producers for all

source registers– After: declare self as most recent producer

of target register• Complication:

– May have to retract• Speculative Execution, e.g., interrupts

– Need to be able to restore the mapping state

Page 66: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Register Renaming Support Structures

• Register Rename Table– f(aR) = pR– one entry per architectural Register

• Free Register List– Lists not used Physical Registers

• At Decode– grab a new register from the free list– Change mapping in rename table

• At Commit– Release Register? Not… Why?– Could release previous version

Page 67: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

How Many Physical Registers?

• Correctness:– At least as many architectural plus?

• Performance:– As many as possible– Not correctness– Recall not all instructions produce register

results• stores and branches

Page 68: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Dynamic Scheduling

A: DIVF F3, F1, F0 r1, -, -

B: SUBF F2, F1, F0 r2, -, -

C: MULF F0, F2, F4 r3, r2, -

D: SUBF F6, F2, F3 r4, r2, r1

E: ADDF F2, F5, F4 r5, -, -

F: ADDF F0, F0, F2 r6, r3, r5

Name Value- Values and Names flow together- Writeback specifies both value and name- A waiting instruction inspects all results- It is allowed to execute when all inputs are available

Page 69: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Physical Registers

• Physical register file is just one option• What we need is separate storage

– Consumers could keep values in their reservation station

– Tomasulo’s next

Page 70: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Tomasulo’s Algorithm• IBM 360/91 - Fast 360 for scientific code

– Completed in 1967– Dynamic scheduling– Predates cache memories

• Pipelined FUs– Adder up to 3 instructions– Multiplier up to 2 instructions

• Tomasulo vs. Scoreboard– Distributed hazard detection and control– Results are bypassed to FUs– Common Data Bus (CDB) for results

• All results visible to all instead of via a register

Page 71: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

DLX w/ Tomasulo• Tomasulo’s Algorithm

– Use “tags” to identify data values– Reservation stations distributed control– CDB broadcasts all results to all RSs

• Extend DLX as example– Assume multiple FUs than pipelined– Main difference is Register-Memory Insts.

• I.e., DLX does not have them• But that’s really a detail :-)

• Physical Registers?– Not really. What we need is different storage and

name for every version.– Here it’s the producing reservation station

Page 72: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Dynamic DLX

adders Mults

Load buffers Store buffers

CDB

RSRS

Operation Stack Registers

Page 73: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Tomasulo’s Algorithm• 3 major steps

– Dispatch• Get instruction from fetch queue• ALU op: check for available RS• Load: Check for available load buffer• If available: dispatch and copy read regs to RS or

load buffer• if not: stall - structural hazard

– Issue• If all ops are available: issue• If not monitor CDB for operands

– Complete• If CDB available, broadcast result• else stall

Page 74: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Tomasulo’s Algorithm contd.• Reservation stations

– Handle distributed hazard detection and instruction control

• Everything receiving data get its tag– 4-bit tag specifies reservation station or load buffer– Also which FU will produce result

• Register specifier is used to assign tags– Then they are discarded– Input register specifiers are ONLY used in dispatch.

(Rename table)• Common Data Bus:

– value + “tag” = where this comes from– vs. typical bus: value + “tag” = where this goes to

Page 75: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Tomasulo’s Algorithm Contd.

• Reservation Stations– Op Opcode– Qj, Qk Tag Fields (source ops)– Vj, VkOperand values (source ops)– Busy Currently in use

• Register file and Store Buffer– Qi Tag field– Busy Currently in use– Vi Value

• Load Buffers– Busy Currently in Use

Page 76: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Arch.Reg. Name

Tomasulo’s: Understanding Speculative vs. Architectural State

• add r1, r2, 10• sub r4, r1, 20• add r1, r3, 30

Value of r1I have it

Value of r2I have it

Value of r3I have it

Value of r4I have it

Register file

Whe

re is

the

regi

ster

?

Can be: “I have it”, “reservation station id”

Value of Src1NA NA Value of Src2NA

tgt src2

Value of Src1NA NA Value of Src2NA

Reservation Stations

Reg Arch. name

src1

Page 77: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Renaming 1st Instruction

• add r1, r2, 10• sub r4, r1, 20• add r1, r3, 30

-----RS0

Value of r2I have it

Value of r3I have it

Value of r4I have it

Register file

Value of R2r1 I have it 10I have it

tgt src2

Value of Src1NA NA Value of Src2NA

Reservation Stations

src1

Value of Src1NA NA Value of Src2NA

RS0

• Read sources (r2)• Rename r1 to RS0

Page 78: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Renaming 2nd Instruction

-----RS0

Value of r2I have it

Value of r3I have it

----RS1

Register file

Value of R2r1 I have it 10I have it

tgt src2

----r4 RS0 20I have it

Reservation Stations

src1

Value of Src1NA NA Value of Src2NA

RS1

• Sources: r1 in RS0 NYA• Rename r4 to RS1

• add r1, r2, 10• sub r4, r1, 20• add r1, r3, 30

Page 79: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Renaming 3rd Instruction

-----RS2

Value of r2I have it

Value of r3I have it

----RS1

Register file

Value of R2r1 I have it 10I have it

tgt src2

----r4 RS0 20I have it

Reservation Stations

src1

Value of R3r1 I have it 30I have itRS2

• Sources: r3 Avail.• Rename r1 to RS2

• add r1, r2, 10• sub r4, r1, 20• add r1, r3, 30

Page 80: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Example: cycle 0Instruction status Execution Write

Instruction j k Issue complete Result

LD F6 34+ R2LD F2 45+ R3MULTDF0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk

0 Add1 No0 Add2 No0 Add3 No0 Mult1 No0 Mult2 No

Register result status

F0 F2 F4 F6 F8 F10 ...

FU

Busy AddressLoad1 NoLoad2 NoLoad3 No

load buffers

Page 81: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Example: cycle 1Instruction status Execution Write

Instruction j k Issue complete Result

LD F6 34+ R2 1LD F2 45+ R3MULTDF0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk

0 A1 No0 A2 No0 A3 No0 M1 No0 M2 No

Register result status

F0 F2 F4 F6 F8 F10 ...

FU L1

Busy AddressL1 yesL2 NoL3 No

load buffers

34+R2

Page 82: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Example: cycle 3Instruction status Execution Write

Instruction j k Issue complete Result

LD F6 34+ R2 1 3LD F2 45+ R3 2MULTDF0 F2 F4 3SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk

0 A1 No0 A2 No0 A3 No0 M1 Yes Mul R(F4) L20 M2 No

Register result status

F0 F2 F4 F6 F8 F10 ...

FU M1 L2 L1

Busy AddressL1 yesL2 NoL3 No

load buffers

34+R245+R3

- Mul is issued vs. scoreboard- What’s waiting for L1?

Page 83: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Example…

• Check the web site…• Too much for in-class• Summary:

– Execution proceeds in any order that does not violate RAW dependences

– WAR and WAW are removed

Page 84: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Tomasulo’s vs. ScoreboardInstruction status Execution WriteInstruction j k Issue complete ResultLD F6 34+ R2 1 3 4LD F2 45+ R3 2 4 5MULTDF0 F2 F4 3 15 16SUBDF8 F6 F2 4 7 8DIVDF10 F0 F6 5 56 57ADDDF6 F8 F2 6 10 11

- In-order issue- Out-of-order execution- Out-of-order completion

Instruction status Read Execution WriteInstruction j k Issue operandscomplete ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9 19 20SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8 21 61 62ADDD F6 F8 F2 13 14 16 22

Scoreboard:

Page 85: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Tomasulo’s• Out-of-order loads and stores?

– What about WAW, RAW and WAR?– Compare all load addresses against the addresses of

all preceding store buffers– Stall if they match

• CDB is a bottleneck– One write per cycle– Could duplicate– But, come at a cost– Datapath + duplicated tags and control

• Complex Implementation– Scalability?– All results to all sources– What if we want 128 instrs?

Page 86: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Tomasulo’s• Advantages

– Distribution of hazard detection– Elimination of WAR and WAW stalls

• Common Data Bus– Broadcasts result to multiple instrs (+)– Bottleneck

• Register Renaming– Removes WAR and WAW hazards– More interesting when same code appears twice

• Think of loops• More on this later

– BUT: Associative lookups– RECALL: direct map is faster

Page 87: A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions Out-of-Order.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

In SummaryFeature Scoreboarding Tomasulo's

CDC6600 IBM 360

Structural Stall in Issue for Stall in DispatchFU for RS

Stall in RS for FURAW Via Registers From CDB

WAR Stall in WB Copy Value to RS

WAW Stall in Issue Register Renaming

Logic Centralized Distributed

Bottlenecks No Register One CDBBypassStall in issue block