Advanced Pipelining

• Optimally Scheduling Code

• Optimally Programming Code

• Scheduling for Superscalars (6.9)

• Exceptions (5.6, 6.8)

Optimally schedule code

• for(i=0;i<N;i++)• A[i] = A[i] + 10;

• & (A[0]) in $s1• & (A[i]) in $s2

slt $t1, $s3, $s0

beq $t1, $0, end

lw $t0, 0($s1)

addi $t0, $t0, 10

sw $t0, 0($s1)

addi $s1, $s1, 4

slt $t1, $s1, $s2

bne $t1, $0, loop

1. Identify Dependencies

lw $t0, 0($s1)

addi $t0, $t0, 10

sw $t0, 0($s1)

addi $s1, $s1, 4

slt $t1, $s1, $s2

bne $t1, $0, loop

$t0 – lw->addi – RAW$t0 – addi->sw - RAW

2. Draw timing diagramWITH DATA FORWARDING

lw $t0, 0($s1)

addi $t0, $t0, 10

sw $t0, 0($s1)

addi $s1, $s1, 4

slt $t1, $s1, $s2

bne $t1, $0, loop

F D X M W

3. Remove WAR/WAW dependencies

lw $t0, 0($s1)

addi $t0, $t0, 10

sw $t0, 0($s1)

addi $s1, $s1, 4

slt $t1, $s1, $s2

bne $t1, $0, loop

RAW, WAR, WAW

F D X M W F D X M W F D X M W F D X M W F D X M W F D X M W

Target the false dependencies

lw $t0, 0($s1)

sw $t0, 0($s1)

addi $s1, $s1, 4

lw $t0, 0($s1)

addi $s1, $s1, 4 sw $t0, 0($s1)

lw $t0, 0($s1)

Original Incorrect Correct

lw $t0, 0($s1)

addi $s1, $s1, 4

addi $t0, $t0, 10

sw $t0, ____($s1)

slt $t1, $s1, $s2

bne $t1, $0, loop

lw $t0, 0($s1)

addi $t0, $t0, 10

sw $t0, 0($s1)

addi $s1, $s1, 4

slt $t1, $s1, $s2

bne $t1, $0, loop

lw $t0, 0($s1)

addi $s1, $s1, 4

addi $t0, $t0, 10

slt $t1, $s1, $s2

sw $t0, -4($s1)

bne $t1, $0, loop

F D X M W F D X M W F D X M W F D X M W F D X M W F D X M W

Software Control Hazard Removal

If ( (x % 2) == 1)isodd = 1;

If ( x == true)y = false;

elsey = true;

If ((x == MON) || (x == TUE) || (x == WED)){}

If ((TheCoinTossIsHeads) || (StudentStudiedForExam)){}

Increasing Branch Performance

What does it all mean?

• Does that mean that error-checking code is bad? That is a whole lot of branches if you do it well!!!

The moral is…..

• Calculation is less expensive than …..

Superscalars - Parallelism

Ford mass produces cars. We want to “mass produce” instructions

Increase Depth – assembly line – build many cars at the same time, but each car is in a different stage of assembly.

Increase Width – multiple assembly lines – build many cars at the same time by building many line, all of which operate simultaneously.

“Superpipelining” (deep pipelining – many stages)

• Limiting returns because….

• Register delays are __________________________ of clock

• Difficult to __________________

SuperScalars

• __________ parts of pipeline

• Multiple instructions in _______ stage at once

SuperScalars

• Which instructions can execute in parallel?

• Fetching multiple instructions per cycle

Static Scheduling – VLIW or EPIC (Itanium)

• __________ schedules the instructions

• If one instruction stalls, all following instructions stall

• Book Example: SuperScalar MIPS:• Two instructions / cycle

• one alu/branch, one ld/st each cycle

Schedule for SS MIPSLoop: lw $t0, 0($s1)

addu $t0, $t0, $s2sw $t0, 0($s1)addi $s1, $s1, -4bne $s1, $zero,Loop

PC ALU/branch ld/st08162432

SuperScalars - Static

Fetch Memory WriteBackExecuteDecode

Read Values Write Values

Loop Problem

• Problem:– Too many _______________ in loop

– Not enough ______________ to fill in holes

• Solution:– Do ______________ at once

– More instructions

– Only one branch

Loop Unrolling1. Unroll Loop

Loop: lw $t0, 0($s1)addi $s1, $s1, -4addu $t0, $t0, $s2sw $t0, 4($s1) lw $t0, 0($s1)addi $s1, $s1, -4addu $t0, $t0, $s2sw $t0, 4($s1)bne $s1, $zero,Loop

Loop: lw $t0, 0($s1) addi $s1, $s1, -4 addu $t0, $t0, $s2sw $t0, 4($s1)bne $s1, $zero,Loop

Loop Unrolling2. Rename Registers

But wait!!! How has this helped? There are tons of dependencies?Whatever are we to do? Register Renaming!!!

Loop Unrolling2. Rename Registers

(Repeated slide for your reference)

Loop Unrolling3. Reduce Instructions

Loop: lw $t0, 0($s1)addi $s1, $s1, -8addu $t0, $t0, $s2sw $t0, 8($s1) lw $t1, 4($s1)addu $t1, $t1, $s2sw $t1, 4($s1)bne $s1, $zero,Loop

Loop: lw $t0, 0($s1)addi $s1, $s1, -4addi $s1, $s1, -4addu $t0, $t0, $s2sw $t0, ___($s1) lw $t1, ___($s1)addu $t1, $t1, $s2sw $t1, 4($s1)bne $s1, $zero,Loop

Loop Unrolling4. Schedule

Loop: lw1 $t0, 0($s1)addi $s1, $s1, -8addu1 $t0, $t0, $s2sw1 $t0, 8($s1) lw2 $t1, 4($s1)addu2 $t1, $t1, $s2sw2 $t1, 4($s1)bne $s1, $zero,Loop

ALU/branch lw/swlw1

Performance Comparison

Original Unrolled

ALU/branch ld/stlw $t0, 0($s1)

addi $s1, $s1, -4addu $t0, $t0, $s2bne $s1, $zero,L sw $t0, 4($s1)

Static Scheduling Summary

• Code size ______________ (because of nops)

• It can not resolve __________ dependencies

• If one instruction stalls, ___________________

Dynamic Scheduling

• _________ schedules ready instructions

• Only ___________ instructions stall

• _______________ resolved in hardware

4-wide Dynamic SuperscalarFetch

Register FileInstruction Window

Ld/St 1Add 2Add 3Add

CommitBuffer

Ld/StQueue

2add1 1add1 2 3

Register Alias Table

lw r2, 0(s1)

Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop

addu r2,ldst1,r5sw 1add1, 0(s1)

addi r1,r1,-4bne 2add1,r7,Loop

lw r2, 0(s1)

sw r2, 0(s1)addu r2,r2,r5

addi r1,r1,-4

addi r1,r1,-4lw r2, 0(s1)

Fetch 4 instructions each

addu r2,ldst1,r5addi r1,r1,-4

bne 2add1,r7,Loop

sw r2, 0(s1)

4-wide Dynamic SuperscalarDecode

CommitBuffer

Ld/StQueue

2add1 1add1 2 3

lw r2, 0(s1)

addi r1,r1,-4

Register Alias Table records 1. Current Register Number

(WAW/WAR Register Renaming)

bne 2add1,r7,Loop

sw r2, 0(s1)

4-wide Dynamic SuperscalarDecode

CommitBuffer

Ld/StQueue

2add1 1add1 2 3

lw r2, 0(s1)

addi r1,r1,-4

Register Alias Table records 1. Current Register Number

(WAW/WARRegister Renaming)

or2. Functional Unit

(RAW – result not ready)

bne 2add1,r7,Loop

sw r2, 0(s1)

4-wide Dynamic SuperscalarExecute

CommitBuffer

Ld/StQueue

2add1 1add1 2 3

lw r2, 0(s1)

addi r1,r1,-4

Wait until your inputs are ready

bne 2add1,r7,Loop

sw r2, 0(s1)

4-wide Dynamic SuperscalarExecute

CommitBuffer

Ld/StQueue

2add1 1add1 2 3

lw r2, 0(s1)

addi r1,r1,-4

Execute once they are ready

bne 2add1,r7,Loop

sw r2, 0(s1)

4-wide Dynamic SuperscalarMemory

CommitBuffer

Ld/StQueue

2add1 1add1 2 3

Register Alias TableLoop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop

lw r2, 0(s1)

addi r1,r1,-4

First calculate the address

bne 2add1,r7,Loop

sw r2, 0(s1)lw r2, 0(s1)

4-wide Dynamic SuperscalarMemory

CommitBuffer

Ld/StQueue

2add1 1add1 2 3

lw r2, 0(s1)

addi r1,r1,-4

Ld/St Queue checks memory addresses – out

of order lw/sw

bne 2add1,r7,Loop

sw r2, 0(s1)

4-wide Dynamic SuperscalarCommit

CommitBuffer

Ld/StQueue

2add1 1add1 2 3

lw r2, 0(s1)

KEYWaiting for valueReading value

lw r2, 0(s1)

addi r1,r1,-4

addu r2,r2,r5addi r1,r1,-4

bne r1,r7,Loop

sw r2, 0(s1)

Instructions wait until all previous instructions

have completed

Fallacies & Pitfalls• Pipelining is easy

–______________ is difficult

• Instruction set has no impact on pipelining–Complicated _____________

& _____________________ instructions complicate pipelining immensely

Technology Influences

• Pipelining ideas are good ideas regardless of technology–Only recently, with extra chip

space, has ___________________ become better than ____________________

–Now, pipelining limited by ________

Exceptions –Unexpected Events

• Internal • External

Definitions

a. Anything unexpected happens

b. External event occurs

c. Internal event occurs

d. Change in control flow

Exception Interrupt

PowerPC

Exception-Handling

• Stop• Transfer control to OS• Tell OS what

happened• Begin executing

where we left off

1. Detect Exception

• Add control lines to detect errors

Step 2: Store PC into EPC

Read Addr Out Data

InstructionMemory

src1 src1data

src2 src2dataRegister File

destreg

destdata

op/funrsrtrdimm

Addr Out Data

Data Memory

In Data

32Sign Ext

Step 3: Tell OS the problem

• Store error code in the _________

• Use vectored interrupts

– Use error code to determine _________

Cause Register

• Set a flag in the cause register

• How does the OS find out if an overflow occurred if the bit corresponding to an overflow is bit 5?

Vectored Interrupts

• The address of trap handler is determined by cause

Exception type Exception vector address (in hex)

Undefined Instruction C0 00 00 00hex

Arithmetic Overflow C0 00 00 20hex

Cause Register – Go to OS

Read Addr Out Data

InstructionMemory

src1 src1data

destreg

destdata

op/funrsrtrdimm

Addr Out Data

Data Memory

In Data

32Sign Ext

EPC-4 Cause

Handler PC

Vectored Interrupt – Go to OS

Read Addr Out Data

InstructionMemory

src1 src1data

destreg

destdata

op/funrsrtrdimm

Addr Out Data

Data Memory

In Data

32Sign Ext

Cause Vector Table

Steps for Exceptions

• Detect exception

• Place processor in state before offending instruction

• Record exception type

• Record instruction’s PC in EPC

• Transfer control to OS

What happens if the third instruction is undefined?

Time->

add $s0, $0, $0

lw $s1, 0($t0)

undefined

or $s3, $s4, $t3

1 2 3 4 5 6 7 8

In what stage is it detected? In what cycle?

1. Detection

• Must associate exception with proper instruction

• What happens if multiple exceptions happen in the same cycle?

Time->

add $s0, $0, $0

lw $s1, 0($t0)

undefined

or $s3, $s4, $t3

1 2 3 4 5 6 7 8

2. Preserve state before instruction

What? What does that mean?!?

3. Record exception type

• Place value in cause register or

• Use vectored interrupts– (exception routine address dependent on

exception type)

Addr Instr

Inst Mem

src1 src1datasrc2

RegFile src2datadestdestdata

ALUAddr OutData

DataMem

InData

Undef addlwor

4. Record PC in EPCMachine in detection cycle

Addr Instr

Inst Mem

src1 src1datasrc2

RegFile src2datadestdestdata

ALUAddr OutData

DataMem

InData

4. Record PC in EPCMachine in before transfer

Where is the proper PC? Long gone!!!

4. Record PC in EPC

• Non-trivial because PC changes each cycle, and exceptions can be detected in several stages (decode, execute, memory)

• Precise exceptions

• Imprecise exceptions

5. Transfer control to OS

• Same as before

Advanced Pipelining

Documents

Transcript of Advanced Pipelining

Advanced Pipelining and Instruction Level Parallelism (ILP)

Pipelining Lessons

CS152 Computer Architecture and Engineering Lecture 15 Advanced pipelining/Compiler Scheduling

Lecture: Pipelining Basicscs6810/pres/14-6810-03.pdf · Lecture: Pipelining Basics • Topics: Performance equations wrap-up, Basic pipelining implementation Video 1: What is pipelining?

Todayʼs Menu Multi-Cycle Exceptions Exceptions ... · 13 Pipelining Multicycle Pipelining Let’s build cars 14 Pipelining Can we go faster? Pipelining: Production assembly lines

Lecture 3: Introduction to Advanced Pipelining

Chapter6 pipelining

Advanced issues in pipelining

Pipelining Verilog

Advanced Computer Architecture Chapter 1.2 Pipelining: a ...

Advanced Parallel Architecture - uniroma1.ittwiki.di.uniroma1.it/pub/AAP/WebHome/2017-lesson56-Pipeline.pdf · Appendix C –Sections C.1, C.2. Pipelining Pipelining is an implementation

Advanced Pipelining CS740 October 30, 2007

CS136, Advanced Architecture Basics of Pipelining.

Advanced Topics in Pipelining - SMT and Single-Chip Multiprocessor

Pipelining and Retiming Prepared by Mark Jarvin. Agenda Synchronous circuit retiming Pipelining Software pipelining.

CMPE 421 Advanced Computer Architecture Supplementary material for Pipelining PART1.

Pipelining & Parallel Processing - ics.kaist.ac.krics.kaist.ac.kr/ee878_2018f/[EE878]3 Pipelining and Parallel Processing.pdf · Pipelining processing By using pipelining latches

Advanced Pipelining and Instruction- Level Parallelismcsit-sun.pub.ro/resources/cn/comp_arch/chap04.pdf · 222 Chapter 4 Advanced Pipelining and Instruction-Level Parallelism The

Advanced Computer Architectureadiaz/ArqComp/02...CISC, RISC, Advanced memory systems (caches, memory, virtual memory) Advanced Instruction Level Parallelism (pipelining, superscalar,

CS152 – Computer Architecture and Engineering Lecture 17 – Advanced Pipelining: Tomasulo Algorithm