Advanced Pipelining

Advanced Pipelining

• Optimally Scheduling Code

• Optimally Programming Code

• Scheduling for Superscalars (6.9)

• Exceptions (5.6, 6.8)

Optimally schedule code

• for(i=0;i<N;i++)• A[i] = A[i] + 10;

• & (A[0]) in $s1• & (A[i]) in $s2

slt $t1, $s3, $s0

beq $t1, $0, end

loop:

lw $t0, 0($s1)

addi $t0, $t0, 10

sw $t0, 0($s1)

addi $s1, $s1, 4

slt $t1, $s1, $s2

bne $t1, $0, loop

1. Identify Dependencies

lw $t0, 0($s1)

addi $t0, $t0, 10

sw $t0, 0($s1)

addi $s1, $s1, 4

slt $t1, $s1, $s2

bne $t1, $0, loop

$t0 – lw->addi – RAW$t0 – addi->sw - RAW

2. Draw timing diagramWITH DATA FORWARDING

lw $t0, 0($s1)

addi $t0, $t0, 10

sw $t0, 0($s1)

addi $s1, $s1, 4

slt $t1, $s1, $s2

bne $t1, $0, loop

F D X M W

3. Remove WAR/WAW dependencies

lw $t0, 0($s1)

addi $t0, $t0, 10

sw $t0, 0($s1)

addi $s1, $s1, 4

slt $t1, $s1, $s2

bne $t1, $0, loop

RAW, WAR, WAW

F D X M W F D X M W F D X M W F D X M W F D X M W F D X M W

D

F

F

lw

addi

sw

addi

slt

bne

Target the false dependencies


lw $t0, 0($s1)

sw $t0, 0($s1)

addi $s1, $s1, 4

lw $t0, 0($s1)

addi $s1, $s1, 4 sw $t0, 0($s1)

lw $t0, 0($s1)

addi

sw

Original Incorrect Correct

lw $t0, 0($s1)

addi $s1, $s1, 4

addi $t0, $t0, 10

sw $t0, ____($s1)

slt $t1, $s1, $s2

bne $t1, $0, loop

lw $t0, 0($s1)

addi $t0, $t0, 10

sw $t0, 0($s1)

addi $s1, $s1, 4

slt $t1, $s1, $s2

bne $t1, $0, loop


lw $t0, 0($s1)

addi $s1, $s1, 4

addi $t0, $t0, 10

slt $t1, $s1, $s2

sw $t0, -4($s1)

bne $t1, $0, loop

F D X M W F D X M W F D X M W F D X M W F D X M W F D X M W

lw

addi

sw

addi

slt

bne

Software Control Hazard Removal

If ( (x % 2) == 1)isodd = 1;


If ( x == true)y = false;

elsey = true;

If ((x == MON) || (x == TUE) || (x == WED)){}


If ((TheCoinTossIsHeads) || (StudentStudiedForExam)){}

Increasing Branch Performance

What does it all mean?

• Does that mean that error-checking code is bad? That is a whole lot of branches if you do it well!!!

The moral is…..

• Calculation is less expensive than …..

Superscalars - Parallelism

Ford mass produces cars. We want to “mass produce” instructions

Increase Depth – assembly line – build many cars at the same time, but each car is in a different stage of assembly.

Increase Width – multiple assembly lines – build many cars at the same time by building many line, all of which operate simultaneously.

“Superpipelining” (deep pipelining – many stages)

• Limiting returns because….

• Register delays are __________________________ of clock

• Difficult to __________________

SuperScalars

• __________ parts of pipeline

• Multiple instructions in _______ stage at once

SuperScalars

• Which instructions can execute in parallel?

• Fetching multiple instructions per cycle

Static Scheduling – VLIW or EPIC (Itanium)

• __________ schedules the instructions

• If one instruction stalls, all following instructions stall

• Book Example: SuperScalar MIPS:• Two instructions / cycle

• one alu/branch, one ld/st each cycle

Schedule for SS MIPSLoop: lw $t0, 0($s1)

addu $t0, $t0, $s2sw $t0, 0($s1)addi $s1, $s1, -4bne $s1, $zero,Loop

PC ALU/branch ld/st08162432

SuperScalars - Static

bne

Fetch Memory WriteBackExecuteDecode

Read Values Write Values

addu

sw lw

addi

Loop Problem

• Problem:– Too many _______________ in loop

– Not enough ______________ to fill in holes

• Solution:– Do ______________ at once

– More instructions

– Only one branch

Loop Unrolling1. Unroll Loop

Loop: lw $t0, 0($s1)addi $s1, $s1, -4addu $t0, $t0, $s2sw $t0, 4($s1) lw $t0, 0($s1)addi $s1, $s1, -4addu $t0, $t0, $s2sw $t0, 4($s1)bne $s1, $zero,Loop

Loop: lw $t0, 0($s1) addi $s1, $s1, -4 addu $t0, $t0, $s2sw $t0, 4($s1)bne $s1, $zero,Loop

Loop Unrolling2. Rename Registers


But wait!!! How has this helped? There are tons of dependencies?Whatever are we to do? Register Renaming!!!

Loop Unrolling2. Rename Registers


(Repeated slide for your reference)


Loop Unrolling3. Reduce Instructions

Loop: lw $t0, 0($s1)addi $s1, $s1, -8addu $t0, $t0, $s2sw $t0, 8($s1) lw $t1, 4($s1)addu $t1, $t1, $s2sw $t1, 4($s1)bne $s1, $zero,Loop

Loop: lw $t0, 0($s1)addi $s1, $s1, -4addi $s1, $s1, -4addu $t0, $t0, $s2sw $t0, ___($s1) lw $t1, ___($s1)addu $t1, $t1, $s2sw $t1, 4($s1)bne $s1, $zero,Loop

Loop Unrolling4. Schedule

Loop: lw1 $t0, 0($s1)addi $s1, $s1, -8addu1 $t0, $t0, $s2sw1 $t0, 8($s1) lw2 $t1, 4($s1)addu2 $t1, $t1, $s2sw2 $t1, 4($s1)bne $s1, $zero,Loop

ALU/branch lw/swlw1

Performance Comparison

Original Unrolled

ALU/branch ld/stlw $t0, 0($s1)

addi $s1, $s1, -4addu $t0, $t0, $s2bne $s1, $zero,L sw $t0, 4($s1)

Static Scheduling Summary

• Code size ______________ (because of nops)

• It can not resolve __________ dependencies

• If one instruction stalls, ___________________

Dynamic Scheduling

• _________ schedules ready instructions

• Only ___________ instructions stall

• _______________ resolved in hardware

4-wide Dynamic SuperscalarFetch

Register FileInstruction Window

Ld/St 1Add 2Add 3Add

CommitBuffer

Ld/StQueue

2add1 1add1 2 3

Register Alias Table

lw r2, 0(s1)

Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop

addu r2,ldst1,r5sw 1add1, 0(s1)

addi r1,r1,-4bne 2add1,r7,Loop

lw r2, 0(s1)

sw r2, 0(s1)addu r2,r2,r5

addi r1,r1,-4

addi r1,r1,-4lw r2, 0(s1)

Fetch 4 instructions each

cycle

addu r2,ldst1,r5addi r1,r1,-4

bne 2add1,r7,Loop

sw r2, 0(s1)

4-wide Dynamic SuperscalarDecode



CommitBuffer

Ld/StQueue

2add1 1add1 2 3


lw r2, 0(s1)




lw r2, 0(s1)


addi r1,r1,-4


Register Alias Table records 1. Current Register Number

(WAW/WAR Register Renaming)

or


bne 2add1,r7,Loop

sw r2, 0(s1)

4-wide Dynamic SuperscalarDecode



CommitBuffer

Ld/StQueue

2add1 1add1 2 3


lw r2, 0(s1)




lw r2, 0(s1)


addi r1,r1,-4


Register Alias Table records 1. Current Register Number

(WAW/WARRegister Renaming)

or2. Functional Unit

(RAW – result not ready)


bne 2add1,r7,Loop

sw r2, 0(s1)

4-wide Dynamic SuperscalarExecute



CommitBuffer

Ld/StQueue

2add1 1add1 2 3


lw r2, 0(s1)




lw r2, 0(s1)


addi r1,r1,-4


Wait until your inputs are ready


bne 2add1,r7,Loop

sw r2, 0(s1)

4-wide Dynamic SuperscalarExecute



CommitBuffer

Ld/StQueue

2add1 1add1 2 3


lw r2, 0(s1)




lw r2, 0(s1)


addi r1,r1,-4


Execute once they are ready


bne 2add1,r7,Loop

sw r2, 0(s1)

4-wide Dynamic SuperscalarMemory



CommitBuffer

Ld/StQueue

2add1 1add1 2 3

Register Alias TableLoop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop



lw r2, 0(s1)


addi r1,r1,-4


First calculate the address


bne 2add1,r7,Loop

sw r2, 0(s1)lw r2, 0(s1)

4-wide Dynamic SuperscalarMemory



CommitBuffer

Ld/StQueue

2add1 1add1 2 3


lw r2, 0(s1)




lw r2, 0(s1)


addi r1,r1,-4


Ld/St Queue checks memory addresses – out

of order lw/sw


bne 2add1,r7,Loop

sw r2, 0(s1)

4-wide Dynamic SuperscalarCommit



CommitBuffer

Ld/StQueue

2add1 1add1 2 3


lw r2, 0(s1)

KEYWaiting for valueReading value




lw r2, 0(s1)


addi r1,r1,-4


addu r2,r2,r5addi r1,r1,-4

bne r1,r7,Loop

sw r2, 0(s1)

Instructions wait until all previous instructions

have completed

Fallacies & Pitfalls• Pipelining is easy

–______________ is difficult

• Instruction set has no impact on pipelining–Complicated _____________

& _____________________ instructions complicate pipelining immensely

Technology Influences

• Pipelining ideas are good ideas regardless of technology–Only recently, with extra chip

space, has ___________________ become better than ____________________

–Now, pipelining limited by ________

Exceptions –Unexpected Events

• Internal • External

Definitions

a. Anything unexpected happens

b. External event occurs

c. Internal event occurs

d. Change in control flow

Exception Interrupt

PowerPC

Intel

MIPS

Exception-Handling

• Stop• Transfer control to OS• Tell OS what

happened• Begin executing

where we left off

1. Detect Exception

• Add control lines to detect errors

Step 2: Store PC into EPC

Read Addr Out Data

InstructionMemory

PC

Inst

4

src1 src1data

src2 src2dataRegister File

destreg

destdata

op/funrsrtrdimm

Addr Out Data

Data Memory

In Data

32Sign Ext

16

<<2

<<2

Step 3: Tell OS the problem

• Store error code in the _________

• Use vectored interrupts

– Use error code to determine _________

Cause Register

• Set a flag in the cause register

• How does the OS find out if an overflow occurred if the bit corresponding to an overflow is bit 5?

Vectored Interrupts

• The address of trap handler is determined by cause

Exception type Exception vector address (in hex)

Undefined Instruction C0 00 00 00hex

Arithmetic Overflow C0 00 00 20hex

Cause Register – Go to OS

Read Addr Out Data

InstructionMemory

PC

Inst

4

src1 src1data


destreg

destdata

op/funrsrtrdimm

Addr Out Data

Data Memory

In Data

32Sign Ext

16

<<2

<<2

EPC-4 Cause

Handler PC

Vectored Interrupt – Go to OS

Read Addr Out Data

InstructionMemory

PC

Inst

4

src1 src1data


destreg

destdata

op/funrsrtrdimm

Addr Out Data

Data Memory

In Data

32Sign Ext

16

<<2

<<2

EPC-4

Cause Vector Table

Steps for Exceptions

• Detect exception

• Place processor in state before offending instruction

• Record exception type

• Record instruction’s PC in EPC

• Transfer control to OS

What happens if the third instruction is undefined?

Time->

add $s0, $0, $0

lw $s1, 0($t0)

undefined

or $s3, $s4, $t3

IF ID

IF ID

IF

MEM

ID

IF

1 2 3 4 5 6 7 8

ID WB

MEM

WB

MEM

WB

MEM

WB

In what stage is it detected? In what cycle?

1. Detection

1. Detection

• Must associate exception with proper instruction

• What happens if multiple exceptions happen in the same cycle?

Time->

add $s0, $0, $0

lw $s1, 0($t0)

undefined

or $s3, $s4, $t3

IF ID

IF ID

IF

MEM

ID

IF

1 2 3 4 5 6 7 8

2. Preserve state before instruction

What? What does that mean?!?

3. Record exception type

• Place value in cause register or

• Use vectored interrupts– (exception routine address dependent on

exception type)

PC

44

Addr Instr

Inst Mem

src1 src1datasrc2

RegFile src2datadestdestdata

ALUAddr OutData

DataMem

InData

X

<

Undef addlwor

4. Record PC in EPCMachine in detection cycle

PC

44

Addr Instr

Inst Mem

src1 src1datasrc2

RegFile src2datadestdestdata

ALUAddr OutData

DataMem

InData

X

<

Undef

4. Record PC in EPCMachine in before transfer

Where is the proper PC? Long gone!!!

4. Record PC in EPC

• Non-trivial because PC changes each cycle, and exceptions can be detected in several stages (decode, execute, memory)

• Precise exceptions

• Imprecise exceptions

5. Transfer control to OS

• Same as before

Advanced Pipelining

Documents

Transcript of Advanced Pipelining