Advanced Pipelining

Post on 06-Feb-2016

51 views 0 download

description

Advanced Pipelining. Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8). for(i=0;i

Transcript of Advanced Pipelining

Advanced Pipelining

• Optimally Scheduling Code

• Optimally Programming Code

• Scheduling for Superscalars (6.9)

• Exceptions (5.6, 6.8)

Optimally schedule code

• for(i=0;i<N;i++)• A[i] = A[i] + 10;

• & (A[0]) in $s1• & (A[i]) in $s2

slt $t1, $s3, $s0

beq $t1, $0, end

loop:

lw $t0, 0($s1)

addi $t0, $t0, 10

sw $t0, 0($s1)

addi $s1, $s1, 4

slt $t1, $s1, $s2

bne $t1, $0, loop

1. Identify Dependencies

lw $t0, 0($s1)

addi $t0, $t0, 10

sw $t0, 0($s1)

addi $s1, $s1, 4

slt $t1, $s1, $s2

bne $t1, $0, loop

$t0 – lw->addi – RAW$t0 – addi->sw - RAW

2. Draw timing diagramWITH DATA FORWARDING

lw $t0, 0($s1)

addi $t0, $t0, 10

sw $t0, 0($s1)

addi $s1, $s1, 4

slt $t1, $s1, $s2

bne $t1, $0, loop

F D X M W

3. Remove WAR/WAW dependencies

lw $t0, 0($s1)

addi $t0, $t0, 10

sw $t0, 0($s1)

addi $s1, $s1, 4

slt $t1, $s1, $s2

bne $t1, $0, loop

RAW, WAR, WAW

F D X M W F D X M W F D X M W F D X M W F D X M W F D X M W

D

F

F

lw

addi

sw

addi

slt

bne

Target the false dependencies

3. Remove WAR/WAW dependencies

lw $t0, 0($s1)

sw $t0, 0($s1)

addi $s1, $s1, 4

lw $t0, 0($s1)

addi $s1, $s1, 4 sw $t0, 0($s1)

lw $t0, 0($s1)

addi

sw

Original Incorrect Correct

lw $t0, 0($s1)

addi $s1, $s1, 4

addi $t0, $t0, 10

sw $t0, ____($s1)

slt $t1, $s1, $s2

bne $t1, $0, loop

lw $t0, 0($s1)

addi $t0, $t0, 10

sw $t0, 0($s1)

addi $s1, $s1, 4

slt $t1, $s1, $s2

bne $t1, $0, loop

3. Remove WAR/WAW dependencies

lw $t0, 0($s1)

addi $s1, $s1, 4

addi $t0, $t0, 10

slt $t1, $s1, $s2

sw $t0, -4($s1)

bne $t1, $0, loop

F D X M W F D X M W F D X M W F D X M W F D X M W F D X M W

lw

addi

sw

addi

slt

bne

Software Control Hazard Removal

If ( (x % 2) == 1)isodd = 1;

Software Control Hazard Removal

If ( x == true)y = false;

elsey = true;

If ((x == MON) || (x == TUE) || (x == WED)){}

Software Control Hazard Removal

If ((TheCoinTossIsHeads) || (StudentStudiedForExam)){}

Increasing Branch Performance

What does it all mean?

• Does that mean that error-checking code is bad? That is a whole lot of branches if you do it well!!!

The moral is…..

• Calculation is less expensive than …..

Superscalars - Parallelism

Ford mass produces cars. We want to “mass produce” instructions

Increase Depth – assembly line – build many cars at the same time, but each car is in a different stage of assembly.

Increase Width – multiple assembly lines – build many cars at the same time by building many line, all of which operate simultaneously.

“Superpipelining” (deep pipelining – many stages)

• Limiting returns because….

• Register delays are __________________________ of clock

• Difficult to __________________

SuperScalars

• __________ parts of pipeline

• Multiple instructions in _______ stage at once

SuperScalars

• Which instructions can execute in parallel?

• Fetching multiple instructions per cycle

Static Scheduling – VLIW or EPIC (Itanium)

• __________ schedules the instructions

• If one instruction stalls, all following instructions stall

• Book Example: SuperScalar MIPS:• Two instructions / cycle

• one alu/branch, one ld/st each cycle

Schedule for SS MIPSLoop: lw $t0, 0($s1)

addu $t0, $t0, $s2sw $t0, 0($s1)addi $s1, $s1, -4bne $s1, $zero,Loop

PC ALU/branch ld/st08162432

SuperScalars - Static

bne

Fetch Memory WriteBackExecuteDecode

Read Values Write Values

addu

sw lw

addi

Loop Problem

• Problem:– Too many _______________ in loop

– Not enough ______________ to fill in holes

• Solution:– Do ______________ at once

– More instructions

– Only one branch

Loop Unrolling1. Unroll Loop

Loop: lw $t0, 0($s1)addi $s1, $s1, -4addu $t0, $t0, $s2sw $t0, 4($s1) lw $t0, 0($s1)addi $s1, $s1, -4addu $t0, $t0, $s2sw $t0, 4($s1)bne $s1, $zero,Loop

Loop: lw $t0, 0($s1) addi $s1, $s1, -4 addu $t0, $t0, $s2sw $t0, 4($s1)bne $s1, $zero,Loop

Loop Unrolling2. Rename Registers

Loop: lw $t0, 0($s1)addi $s1, $s1, -4addu $t0, $t0, $s2sw $t0, 4($s1) lw $t1, 0($s1)addi $s1, $s1, -4addu $t1, $t1, $s2sw $t1, 4($s1)bne $s1, $zero,Loop

But wait!!! How has this helped? There are tons of dependencies?Whatever are we to do? Register Renaming!!!

Loop Unrolling2. Rename Registers

Loop: lw $t0, 0($s1)addi $s1, $s1, -4addu $t0, $t0, $s2sw $t0, 4($s1) lw $t1, 0($s1)addi $s1, $s1, -4addu $t1, $t1, $s2sw $t1, 4($s1)bne $s1, $zero,Loop

(Repeated slide for your reference)

Loop: lw $t0, 0($s1)addi $s1, $s1, -4addu $t0, $t0, $s2sw $t0, 4($s1) lw $t0, 0($s1)addi $s1, $s1, -4addu $t0, $t0, $s2sw $t0, 4($s1)bne $s1, $zero,Loop

Loop Unrolling3. Reduce Instructions

Loop: lw $t0, 0($s1)addi $s1, $s1, -8addu $t0, $t0, $s2sw $t0, 8($s1) lw $t1, 4($s1)addu $t1, $t1, $s2sw $t1, 4($s1)bne $s1, $zero,Loop

Loop: lw $t0, 0($s1)addi $s1, $s1, -4addi $s1, $s1, -4addu $t0, $t0, $s2sw $t0, ___($s1) lw $t1, ___($s1)addu $t1, $t1, $s2sw $t1, 4($s1)bne $s1, $zero,Loop

Loop Unrolling4. Schedule

Loop: lw1 $t0, 0($s1)addi $s1, $s1, -8addu1 $t0, $t0, $s2sw1 $t0, 8($s1) lw2 $t1, 4($s1)addu2 $t1, $t1, $s2sw2 $t1, 4($s1)bne $s1, $zero,Loop

ALU/branch lw/swlw1

Performance Comparison

Original Unrolled

ALU/branch ld/stlw $t0, 0($s1)

addi $s1, $s1, -4addu $t0, $t0, $s2bne $s1, $zero,L sw $t0, 4($s1)

Static Scheduling Summary

• Code size ______________ (because of nops)

• It can not resolve __________ dependencies

• If one instruction stalls, ___________________

Dynamic Scheduling

• _________ schedules ready instructions

• Only ___________ instructions stall

• _______________ resolved in hardware

4-wide Dynamic SuperscalarFetch

Register FileInstruction Window

Ld/St 1Add 2Add 3Add

CommitBuffer

Ld/StQueue

2add1 1add1 2 3

Register Alias Table

lw r2, 0(s1)

Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop

addu r2,ldst1,r5sw 1add1, 0(s1)

addi r1,r1,-4bne 2add1,r7,Loop

lw r2, 0(s1)

sw r2, 0(s1)addu r2,r2,r5

addi r1,r1,-4

addi r1,r1,-4lw r2, 0(s1)

Fetch 4 instructions each

cycle

addu r2,ldst1,r5addi r1,r1,-4

bne 2add1,r7,Loop

sw r2, 0(s1)

4-wide Dynamic SuperscalarDecode

Register FileInstruction Window

Ld/St 1Add 2Add 3Add

CommitBuffer

Ld/StQueue

2add1 1add1 2 3

Register Alias Table

lw r2, 0(s1)

Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop

addu r2,ldst1,r5sw 1add1, 0(s1)

addi r1,r1,-4bne 2add1,r7,Loop

lw r2, 0(s1)

sw r2, 0(s1)addu r2,r2,r5

addi r1,r1,-4

addi r1,r1,-4lw r2, 0(s1)

Register Alias Table records 1. Current Register Number

(WAW/WAR Register Renaming)

or

addu r2,ldst1,r5addi r1,r1,-4

bne 2add1,r7,Loop

sw r2, 0(s1)

4-wide Dynamic SuperscalarDecode

Register FileInstruction Window

Ld/St 1Add 2Add 3Add

CommitBuffer

Ld/StQueue

2add1 1add1 2 3

Register Alias Table

lw r2, 0(s1)

Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop

addu r2,ldst1,r5sw 1add1, 0(s1)

addi r1,r1,-4bne 2add1,r7,Loop

lw r2, 0(s1)

sw r2, 0(s1)addu r2,r2,r5

addi r1,r1,-4

addi r1,r1,-4lw r2, 0(s1)

Register Alias Table records 1. Current Register Number

(WAW/WARRegister Renaming)

or2. Functional Unit

(RAW – result not ready)

addu r2,ldst1,r5addi r1,r1,-4

bne 2add1,r7,Loop

sw r2, 0(s1)

4-wide Dynamic SuperscalarExecute

Register FileInstruction Window

Ld/St 1Add 2Add 3Add

CommitBuffer

Ld/StQueue

2add1 1add1 2 3

Register Alias Table

lw r2, 0(s1)

Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop

addu r2,ldst1,r5sw 1add1, 0(s1)

addi r1,r1,-4bne 2add1,r7,Loop

lw r2, 0(s1)

sw r2, 0(s1)addu r2,r2,r5

addi r1,r1,-4

addi r1,r1,-4lw r2, 0(s1)

Wait until your inputs are ready

addu r2,ldst1,r5addi r1,r1,-4

bne 2add1,r7,Loop

sw r2, 0(s1)

4-wide Dynamic SuperscalarExecute

Register FileInstruction Window

Ld/St 1Add 2Add 3Add

CommitBuffer

Ld/StQueue

2add1 1add1 2 3

Register Alias Table

lw r2, 0(s1)

Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop

addu r2,ldst1,r5sw 1add1, 0(s1)

addi r1,r1,-4bne 2add1,r7,Loop

lw r2, 0(s1)

sw r2, 0(s1)addu r2,r2,r5

addi r1,r1,-4

addi r1,r1,-4lw r2, 0(s1)

Execute once they are ready

addu r2,ldst1,r5addi r1,r1,-4

bne 2add1,r7,Loop

sw r2, 0(s1)

4-wide Dynamic SuperscalarMemory

Register FileInstruction Window

Ld/St 1Add 2Add 3Add

CommitBuffer

Ld/StQueue

2add1 1add1 2 3

Register Alias TableLoop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop

addu r2,ldst1,r5sw 1add1, 0(s1)

addi r1,r1,-4bne 2add1,r7,Loop

lw r2, 0(s1)

sw r2, 0(s1)addu r2,r2,r5

addi r1,r1,-4

addi r1,r1,-4lw r2, 0(s1)

First calculate the address

addu r2,ldst1,r5addi r1,r1,-4

bne 2add1,r7,Loop

sw r2, 0(s1)lw r2, 0(s1)

4-wide Dynamic SuperscalarMemory

Register FileInstruction Window

Ld/St 1Add 2Add 3Add

CommitBuffer

Ld/StQueue

2add1 1add1 2 3

Register Alias Table

lw r2, 0(s1)

Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop

addu r2,ldst1,r5sw 1add1, 0(s1)

addi r1,r1,-4bne 2add1,r7,Loop

lw r2, 0(s1)

sw r2, 0(s1)addu r2,r2,r5

addi r1,r1,-4

addi r1,r1,-4lw r2, 0(s1)

Ld/St Queue checks memory addresses – out

of order lw/sw

addu r2,ldst1,r5addi r1,r1,-4

bne 2add1,r7,Loop

sw r2, 0(s1)

4-wide Dynamic SuperscalarCommit

Register FileInstruction Window

Ld/St 1Add 2Add 3Add

CommitBuffer

Ld/StQueue

2add1 1add1 2 3

Register Alias Table

lw r2, 0(s1)

KEYWaiting for valueReading value

Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop

addu r2,ldst1,r5sw 1add1, 0(s1)

addi r1,r1,-4bne 2add1,r7,Loop

lw r2, 0(s1)

sw r2, 0(s1)addu r2,r2,r5

addi r1,r1,-4

addi r1,r1,-4lw r2, 0(s1)

addu r2,r2,r5addi r1,r1,-4

bne r1,r7,Loop

sw r2, 0(s1)

Instructions wait until all previous instructions

have completed

Fallacies & Pitfalls• Pipelining is easy

–______________ is difficult

• Instruction set has no impact on pipelining–Complicated _____________

& _____________________ instructions complicate pipelining immensely

Technology Influences

• Pipelining ideas are good ideas regardless of technology–Only recently, with extra chip

space, has ___________________ become better than ____________________

–Now, pipelining limited by ________

Exceptions –Unexpected Events

• Internal • External

Definitions

a. Anything unexpected happens

b. External event occurs

c. Internal event occurs

d. Change in control flow

Exception Interrupt

PowerPC

Intel

MIPS

Exception-Handling

• Stop• Transfer control to OS• Tell OS what

happened• Begin executing

where we left off

1. Detect Exception

• Add control lines to detect errors

Step 2: Store PC into EPC

Read Addr Out Data

InstructionMemory

PC

Inst

4

src1 src1data

src2 src2dataRegister File

destreg

destdata

op/funrsrtrdimm

Addr Out Data

Data Memory

In Data

32Sign Ext

16

<<2

<<2

Step 3: Tell OS the problem

• Store error code in the _________

• Use vectored interrupts

– Use error code to determine _________

Cause Register

• Set a flag in the cause register

• How does the OS find out if an overflow occurred if the bit corresponding to an overflow is bit 5?

Vectored Interrupts

• The address of trap handler is determined by cause

Exception type Exception vector address (in hex)

Undefined Instruction C0 00 00 00hex

Arithmetic Overflow C0 00 00 20hex

Cause Register – Go to OS

Read Addr Out Data

InstructionMemory

PC

Inst

4

src1 src1data

src2 src2dataRegister File

destreg

destdata

op/funrsrtrdimm

Addr Out Data

Data Memory

In Data

32Sign Ext

16

<<2

<<2

EPC-4 Cause

Handler PC

Vectored Interrupt – Go to OS

Read Addr Out Data

InstructionMemory

PC

Inst

4

src1 src1data

src2 src2dataRegister File

destreg

destdata

op/funrsrtrdimm

Addr Out Data

Data Memory

In Data

32Sign Ext

16

<<2

<<2

EPC-4

Cause Vector Table

Steps for Exceptions

• Detect exception

• Place processor in state before offending instruction

• Record exception type

• Record instruction’s PC in EPC

• Transfer control to OS

What happens if the third instruction is undefined?

Time->

add $s0, $0, $0

lw $s1, 0($t0)

undefined

or $s3, $s4, $t3

IF ID

IF ID

IF

MEM

ID

IF

1 2 3 4 5 6 7 8

ID WB

MEM

WB

MEM

WB

MEM

WB

In what stage is it detected? In what cycle?

1. Detection

1. Detection

• Must associate exception with proper instruction

• What happens if multiple exceptions happen in the same cycle?

Time->

add $s0, $0, $0

lw $s1, 0($t0)

undefined

or $s3, $s4, $t3

IF ID

IF ID

IF

MEM

ID

IF

1 2 3 4 5 6 7 8

2. Preserve state before instruction

What? What does that mean?!?

3. Record exception type

• Place value in cause register or

• Use vectored interrupts– (exception routine address dependent on

exception type)

PC

44

Addr Instr

Inst Mem

src1 src1datasrc2

RegFile src2datadestdestdata

ALUAddr OutData

DataMem

InData

X

<

Undef addlwor

4. Record PC in EPCMachine in detection cycle

PC

44

Addr Instr

Inst Mem

src1 src1datasrc2

RegFile src2datadestdestdata

ALUAddr OutData

DataMem

InData

X

<

Undef

4. Record PC in EPCMachine in before transfer

Where is the proper PC? Long gone!!!

4. Record PC in EPC

• Non-trivial because PC changes each cycle, and exceptions can be detected in several stages (decode, execute, memory)

• Precise exceptions

• Imprecise exceptions

5. Transfer control to OS

• Same as before