Advanced Pipelining
description
Transcript of Advanced Pipelining
Advanced Pipelining
• Optimally Scheduling Code
• Optimally Programming Code
• Scheduling for Superscalars (6.9)
• Exceptions (5.6, 6.8)
Optimally schedule code
• for(i=0;i<N;i++)• A[i] = A[i] + 10;
• & (A[0]) in $s1• & (A[i]) in $s2
slt $t1, $s3, $s0
beq $t1, $0, end
loop:
lw $t0, 0($s1)
addi $t0, $t0, 10
sw $t0, 0($s1)
addi $s1, $s1, 4
slt $t1, $s1, $s2
bne $t1, $0, loop
1. Identify Dependencies
lw $t0, 0($s1)
addi $t0, $t0, 10
sw $t0, 0($s1)
addi $s1, $s1, 4
slt $t1, $s1, $s2
bne $t1, $0, loop
$t0 – lw->addi – RAW$t0 – addi->sw - RAW
2. Draw timing diagramWITH DATA FORWARDING
lw $t0, 0($s1)
addi $t0, $t0, 10
sw $t0, 0($s1)
addi $s1, $s1, 4
slt $t1, $s1, $s2
bne $t1, $0, loop
F D X M W
3. Remove WAR/WAW dependencies
lw $t0, 0($s1)
addi $t0, $t0, 10
sw $t0, 0($s1)
addi $s1, $s1, 4
slt $t1, $s1, $s2
bne $t1, $0, loop
RAW, WAR, WAW
F D X M W F D X M W F D X M W F D X M W F D X M W F D X M W
D
F
F
lw
addi
sw
addi
slt
bne
Target the false dependencies
3. Remove WAR/WAW dependencies
lw $t0, 0($s1)
sw $t0, 0($s1)
addi $s1, $s1, 4
lw $t0, 0($s1)
addi $s1, $s1, 4 sw $t0, 0($s1)
lw $t0, 0($s1)
addi
sw
Original Incorrect Correct
lw $t0, 0($s1)
addi $s1, $s1, 4
addi $t0, $t0, 10
sw $t0, ____($s1)
slt $t1, $s1, $s2
bne $t1, $0, loop
lw $t0, 0($s1)
addi $t0, $t0, 10
sw $t0, 0($s1)
addi $s1, $s1, 4
slt $t1, $s1, $s2
bne $t1, $0, loop
3. Remove WAR/WAW dependencies
lw $t0, 0($s1)
addi $s1, $s1, 4
addi $t0, $t0, 10
slt $t1, $s1, $s2
sw $t0, -4($s1)
bne $t1, $0, loop
F D X M W F D X M W F D X M W F D X M W F D X M W F D X M W
lw
addi
sw
addi
slt
bne
Software Control Hazard Removal
If ( (x % 2) == 1)isodd = 1;
Software Control Hazard Removal
If ( x == true)y = false;
elsey = true;
If ((x == MON) || (x == TUE) || (x == WED)){}
Software Control Hazard Removal
If ((TheCoinTossIsHeads) || (StudentStudiedForExam)){}
Increasing Branch Performance
What does it all mean?
• Does that mean that error-checking code is bad? That is a whole lot of branches if you do it well!!!
The moral is…..
• Calculation is less expensive than …..
Superscalars - Parallelism
Ford mass produces cars. We want to “mass produce” instructions
Increase Depth – assembly line – build many cars at the same time, but each car is in a different stage of assembly.
Increase Width – multiple assembly lines – build many cars at the same time by building many line, all of which operate simultaneously.
“Superpipelining” (deep pipelining – many stages)
• Limiting returns because….
• Register delays are __________________________ of clock
• Difficult to __________________
SuperScalars
• __________ parts of pipeline
• Multiple instructions in _______ stage at once
SuperScalars
• Which instructions can execute in parallel?
• Fetching multiple instructions per cycle
Static Scheduling – VLIW or EPIC (Itanium)
• __________ schedules the instructions
• If one instruction stalls, all following instructions stall
• Book Example: SuperScalar MIPS:• Two instructions / cycle
• one alu/branch, one ld/st each cycle
Schedule for SS MIPSLoop: lw $t0, 0($s1)
addu $t0, $t0, $s2sw $t0, 0($s1)addi $s1, $s1, -4bne $s1, $zero,Loop
PC ALU/branch ld/st08162432
SuperScalars - Static
bne
Fetch Memory WriteBackExecuteDecode
Read Values Write Values
addu
sw lw
addi
Loop Problem
• Problem:– Too many _______________ in loop
– Not enough ______________ to fill in holes
• Solution:– Do ______________ at once
– More instructions
– Only one branch
Loop Unrolling1. Unroll Loop
Loop: lw $t0, 0($s1)addi $s1, $s1, -4addu $t0, $t0, $s2sw $t0, 4($s1) lw $t0, 0($s1)addi $s1, $s1, -4addu $t0, $t0, $s2sw $t0, 4($s1)bne $s1, $zero,Loop
Loop: lw $t0, 0($s1) addi $s1, $s1, -4 addu $t0, $t0, $s2sw $t0, 4($s1)bne $s1, $zero,Loop
Loop Unrolling2. Rename Registers
Loop: lw $t0, 0($s1)addi $s1, $s1, -4addu $t0, $t0, $s2sw $t0, 4($s1) lw $t1, 0($s1)addi $s1, $s1, -4addu $t1, $t1, $s2sw $t1, 4($s1)bne $s1, $zero,Loop
But wait!!! How has this helped? There are tons of dependencies?Whatever are we to do? Register Renaming!!!
Loop Unrolling2. Rename Registers
Loop: lw $t0, 0($s1)addi $s1, $s1, -4addu $t0, $t0, $s2sw $t0, 4($s1) lw $t1, 0($s1)addi $s1, $s1, -4addu $t1, $t1, $s2sw $t1, 4($s1)bne $s1, $zero,Loop
(Repeated slide for your reference)
Loop: lw $t0, 0($s1)addi $s1, $s1, -4addu $t0, $t0, $s2sw $t0, 4($s1) lw $t0, 0($s1)addi $s1, $s1, -4addu $t0, $t0, $s2sw $t0, 4($s1)bne $s1, $zero,Loop
Loop Unrolling3. Reduce Instructions
Loop: lw $t0, 0($s1)addi $s1, $s1, -8addu $t0, $t0, $s2sw $t0, 8($s1) lw $t1, 4($s1)addu $t1, $t1, $s2sw $t1, 4($s1)bne $s1, $zero,Loop
Loop: lw $t0, 0($s1)addi $s1, $s1, -4addi $s1, $s1, -4addu $t0, $t0, $s2sw $t0, ___($s1) lw $t1, ___($s1)addu $t1, $t1, $s2sw $t1, 4($s1)bne $s1, $zero,Loop
Loop Unrolling4. Schedule
Loop: lw1 $t0, 0($s1)addi $s1, $s1, -8addu1 $t0, $t0, $s2sw1 $t0, 8($s1) lw2 $t1, 4($s1)addu2 $t1, $t1, $s2sw2 $t1, 4($s1)bne $s1, $zero,Loop
ALU/branch lw/swlw1
Performance Comparison
Original Unrolled
ALU/branch ld/stlw $t0, 0($s1)
addi $s1, $s1, -4addu $t0, $t0, $s2bne $s1, $zero,L sw $t0, 4($s1)
Static Scheduling Summary
• Code size ______________ (because of nops)
• It can not resolve __________ dependencies
• If one instruction stalls, ___________________
Dynamic Scheduling
• _________ schedules ready instructions
• Only ___________ instructions stall
• _______________ resolved in hardware
4-wide Dynamic SuperscalarFetch
Register FileInstruction Window
Ld/St 1Add 2Add 3Add
CommitBuffer
Ld/StQueue
2add1 1add1 2 3
Register Alias Table
lw r2, 0(s1)
Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop
addu r2,ldst1,r5sw 1add1, 0(s1)
addi r1,r1,-4bne 2add1,r7,Loop
lw r2, 0(s1)
sw r2, 0(s1)addu r2,r2,r5
addi r1,r1,-4
addi r1,r1,-4lw r2, 0(s1)
Fetch 4 instructions each
cycle
addu r2,ldst1,r5addi r1,r1,-4
bne 2add1,r7,Loop
sw r2, 0(s1)
4-wide Dynamic SuperscalarDecode
Register FileInstruction Window
Ld/St 1Add 2Add 3Add
CommitBuffer
Ld/StQueue
2add1 1add1 2 3
Register Alias Table
lw r2, 0(s1)
Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop
addu r2,ldst1,r5sw 1add1, 0(s1)
addi r1,r1,-4bne 2add1,r7,Loop
lw r2, 0(s1)
sw r2, 0(s1)addu r2,r2,r5
addi r1,r1,-4
addi r1,r1,-4lw r2, 0(s1)
Register Alias Table records 1. Current Register Number
(WAW/WAR Register Renaming)
or
addu r2,ldst1,r5addi r1,r1,-4
bne 2add1,r7,Loop
sw r2, 0(s1)
4-wide Dynamic SuperscalarDecode
Register FileInstruction Window
Ld/St 1Add 2Add 3Add
CommitBuffer
Ld/StQueue
2add1 1add1 2 3
Register Alias Table
lw r2, 0(s1)
Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop
addu r2,ldst1,r5sw 1add1, 0(s1)
addi r1,r1,-4bne 2add1,r7,Loop
lw r2, 0(s1)
sw r2, 0(s1)addu r2,r2,r5
addi r1,r1,-4
addi r1,r1,-4lw r2, 0(s1)
Register Alias Table records 1. Current Register Number
(WAW/WARRegister Renaming)
or2. Functional Unit
(RAW – result not ready)
addu r2,ldst1,r5addi r1,r1,-4
bne 2add1,r7,Loop
sw r2, 0(s1)
4-wide Dynamic SuperscalarExecute
Register FileInstruction Window
Ld/St 1Add 2Add 3Add
CommitBuffer
Ld/StQueue
2add1 1add1 2 3
Register Alias Table
lw r2, 0(s1)
Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop
addu r2,ldst1,r5sw 1add1, 0(s1)
addi r1,r1,-4bne 2add1,r7,Loop
lw r2, 0(s1)
sw r2, 0(s1)addu r2,r2,r5
addi r1,r1,-4
addi r1,r1,-4lw r2, 0(s1)
Wait until your inputs are ready
addu r2,ldst1,r5addi r1,r1,-4
bne 2add1,r7,Loop
sw r2, 0(s1)
4-wide Dynamic SuperscalarExecute
Register FileInstruction Window
Ld/St 1Add 2Add 3Add
CommitBuffer
Ld/StQueue
2add1 1add1 2 3
Register Alias Table
lw r2, 0(s1)
Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop
addu r2,ldst1,r5sw 1add1, 0(s1)
addi r1,r1,-4bne 2add1,r7,Loop
lw r2, 0(s1)
sw r2, 0(s1)addu r2,r2,r5
addi r1,r1,-4
addi r1,r1,-4lw r2, 0(s1)
Execute once they are ready
addu r2,ldst1,r5addi r1,r1,-4
bne 2add1,r7,Loop
sw r2, 0(s1)
4-wide Dynamic SuperscalarMemory
Register FileInstruction Window
Ld/St 1Add 2Add 3Add
CommitBuffer
Ld/StQueue
2add1 1add1 2 3
Register Alias TableLoop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop
addu r2,ldst1,r5sw 1add1, 0(s1)
addi r1,r1,-4bne 2add1,r7,Loop
lw r2, 0(s1)
sw r2, 0(s1)addu r2,r2,r5
addi r1,r1,-4
addi r1,r1,-4lw r2, 0(s1)
First calculate the address
addu r2,ldst1,r5addi r1,r1,-4
bne 2add1,r7,Loop
sw r2, 0(s1)lw r2, 0(s1)
4-wide Dynamic SuperscalarMemory
Register FileInstruction Window
Ld/St 1Add 2Add 3Add
CommitBuffer
Ld/StQueue
2add1 1add1 2 3
Register Alias Table
lw r2, 0(s1)
Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop
addu r2,ldst1,r5sw 1add1, 0(s1)
addi r1,r1,-4bne 2add1,r7,Loop
lw r2, 0(s1)
sw r2, 0(s1)addu r2,r2,r5
addi r1,r1,-4
addi r1,r1,-4lw r2, 0(s1)
Ld/St Queue checks memory addresses – out
of order lw/sw
addu r2,ldst1,r5addi r1,r1,-4
bne 2add1,r7,Loop
sw r2, 0(s1)
4-wide Dynamic SuperscalarCommit
Register FileInstruction Window
Ld/St 1Add 2Add 3Add
CommitBuffer
Ld/StQueue
2add1 1add1 2 3
Register Alias Table
lw r2, 0(s1)
KEYWaiting for valueReading value
Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop
addu r2,ldst1,r5sw 1add1, 0(s1)
addi r1,r1,-4bne 2add1,r7,Loop
lw r2, 0(s1)
sw r2, 0(s1)addu r2,r2,r5
addi r1,r1,-4
addi r1,r1,-4lw r2, 0(s1)
addu r2,r2,r5addi r1,r1,-4
bne r1,r7,Loop
sw r2, 0(s1)
Instructions wait until all previous instructions
have completed
Fallacies & Pitfalls• Pipelining is easy
–______________ is difficult
• Instruction set has no impact on pipelining–Complicated _____________
& _____________________ instructions complicate pipelining immensely
Technology Influences
• Pipelining ideas are good ideas regardless of technology–Only recently, with extra chip
space, has ___________________ become better than ____________________
–Now, pipelining limited by ________
Exceptions –Unexpected Events
• Internal • External
Definitions
a. Anything unexpected happens
b. External event occurs
c. Internal event occurs
d. Change in control flow
Exception Interrupt
PowerPC
Intel
MIPS
Exception-Handling
• Stop• Transfer control to OS• Tell OS what
happened• Begin executing
where we left off
1. Detect Exception
• Add control lines to detect errors
Step 2: Store PC into EPC
Read Addr Out Data
InstructionMemory
PC
Inst
4
src1 src1data
src2 src2dataRegister File
destreg
destdata
op/funrsrtrdimm
Addr Out Data
Data Memory
In Data
32Sign Ext
16
<<2
<<2
Step 3: Tell OS the problem
• Store error code in the _________
• Use vectored interrupts
– Use error code to determine _________
Cause Register
• Set a flag in the cause register
• How does the OS find out if an overflow occurred if the bit corresponding to an overflow is bit 5?
Vectored Interrupts
• The address of trap handler is determined by cause
Exception type Exception vector address (in hex)
Undefined Instruction C0 00 00 00hex
Arithmetic Overflow C0 00 00 20hex
Cause Register – Go to OS
Read Addr Out Data
InstructionMemory
PC
Inst
4
src1 src1data
src2 src2dataRegister File
destreg
destdata
op/funrsrtrdimm
Addr Out Data
Data Memory
In Data
32Sign Ext
16
<<2
<<2
EPC-4 Cause
Handler PC
Vectored Interrupt – Go to OS
Read Addr Out Data
InstructionMemory
PC
Inst
4
src1 src1data
src2 src2dataRegister File
destreg
destdata
op/funrsrtrdimm
Addr Out Data
Data Memory
In Data
32Sign Ext
16
<<2
<<2
EPC-4
Cause Vector Table
Steps for Exceptions
• Detect exception
• Place processor in state before offending instruction
• Record exception type
• Record instruction’s PC in EPC
• Transfer control to OS
What happens if the third instruction is undefined?
Time->
add $s0, $0, $0
lw $s1, 0($t0)
undefined
or $s3, $s4, $t3
IF ID
IF ID
IF
MEM
ID
IF
1 2 3 4 5 6 7 8
ID WB
MEM
WB
MEM
WB
MEM
WB
In what stage is it detected? In what cycle?
1. Detection
1. Detection
• Must associate exception with proper instruction
• What happens if multiple exceptions happen in the same cycle?
Time->
add $s0, $0, $0
lw $s1, 0($t0)
undefined
or $s3, $s4, $t3
IF ID
IF ID
IF
MEM
ID
IF
1 2 3 4 5 6 7 8
2. Preserve state before instruction
What? What does that mean?!?
3. Record exception type
• Place value in cause register or
• Use vectored interrupts– (exception routine address dependent on
exception type)
PC
44
Addr Instr
Inst Mem
src1 src1datasrc2
RegFile src2datadestdestdata
ALUAddr OutData
DataMem
InData
X
<
Undef addlwor
4. Record PC in EPCMachine in detection cycle
PC
44
Addr Instr
Inst Mem
src1 src1datasrc2
RegFile src2datadestdestdata
ALUAddr OutData
DataMem
InData
X
<
Undef
4. Record PC in EPCMachine in before transfer
Where is the proper PC? Long gone!!!
4. Record PC in EPC
• Non-trivial because PC changes each cycle, and exceptions can be detected in several stages (decode, execute, memory)
• Precise exceptions
• Imprecise exceptions
5. Transfer control to OS
• Same as before