Prof. Hakim Weatherspoon CS 3410, Spring 2015...RISC and Pipelined Processor: Putting it all...

Prof. Hakim WeatherspoonCS 3410, Spring 2015Computer ScienceCornell University

See P&H Chapter: 4.6‐4.8

Prelim next weekTuesday at 7:30. Go to location based on netid[a‐g]* → MRS146: Morrison Hall 146[h‐l]* → RRB125: Riley‐Robb Hall 125[m‐n]*→ RRB105: Riley‐Robb Hall 105[o‐s]* → MVRG71: M Van Rensselaer Hall G71[t‐z]* → MVRG73: M Van Rensselaer Hall G73

Prelim reviewsTODAY, Tue, Feb 24 @ 7:30pm in Olin 255Sat, Feb 28 @ 7:30pm in Upson B17

Prelim conflictsContact Deniz Altinbuken <deniz@cs.cornell.edu>

Prelim1:• Time: We will start at 7:30pm sharp, so come early• Location: on previous slide• Closed Book

• Cannot use electronic device or outside material

• Practice prelims are online in CMS

• Material covered everything up to end of this week• Everything up to and including data hazards• Appendix B (logic, gates, FSMs, memory, ALUs) • Chapter 4 (pipelined [and non] MIPS processor with hazards)• Chapters 2 (Numbers / Arithmetic, simple MIPS instructions)• Chapter 1 (Performance)• HW1, Lab0, Lab1, Lab2, C‐Lab0, C‐Lab1

RISC and Pipelined Processor: Putting it all togetherData Hazards• Data dependencies• Problem, detection, and solutions

– (delaying, stalling, forwarding, bypass, etc)• Hazard detection unit• Forwarding unit

Next time• Control Hazards

What is the next instruction to execute ifa branch is taken? Not taken?

Simplicity favors regularity• 32 bit instructions

Smaller is faster• Small register file

Make the common case fast• Include support for constants

Good design demands good compromises• Support for different type of interpretations/classes

All MIPS instructions are 32 bits long, has 3 formats

R‐type

I‐type

J‐type

op rs rt rd shamt func6 bits 5 bits 5 bits 5 bits 5 bits 6 bits

op rs rt immediate6 bits 5 bits 5 bits 16 bits

op immediate (target address)6 bits 26 bits

Arithmetic/Logical• R‐type: result and two source registers, shift amount• I‐type: 16‐bit immediate with sign/zero extension

Memory Access• load/store between registers and memory• word, half‐word and byte operations

Control flow• conditional branches: pc‐relative addresses• jumps: fixed offsets, register absolute

Arithmetic/Logical• ADD, ADDU, SUB, SUBU, AND, OR, XOR, NOR, SLT, SLTU• ADDI, ADDIU, ANDI, ORI, XORI, LUI, SLL, SRL, SLLV, SRLV, SRAV, SLTI, SLTIU

• MULT, DIV, MFLO, MTLO, MFHI, MTHIMemory Access

• LW, LH, LB, LHU, LBU, LWL, LWR• SW, SH, SB, SWL, SWR

Control flow• BEQ, BNE, BLEZ, BLTZ, BGEZ, BGTZ• J, JR, JAL, JALR, BEQL, BNEL, BLEZL, BGTZL

Special• LL, SC, SYSCALL, BREAK, SYNC, COPROC

Principle:Throughput increased by parallel executionBalanced pipeline very important

Else slowest stage dominates performance

Pipelining:• Identify pipeline stages• Isolate stages from each other• Resolve pipeline hazards (this and next lecture)

Five stage “RISC” load‐store architecture

1. Instruction fetch (IF)– get instruction from memory, increment PC

2. Instruction Decode (ID)– translate opcode into control signals and read registers

3. Execute (EX)– perform ALU operation, compute jump/branch targets

4.Memory (MEM)– access memory if needed

5.Writeback (WB)– update register file

• Each instruction goes through the 5 stages• Each stage takes one clock cycle

• So slowest stage determines clock cycle time

1 2 3 4 5 6 7 8 9Clock cycle

IF ID EX MEM WB

The pipeline achievesA) Latency: 1, throughput: 1 instr/cycleB) Latency: 5, throughput: 1 instr/cycleC) Latency: 1, throughput: 1/5 instr/cycleD) Latency: 5, throughput: 5 instr/cycleE) None of the above

1 2 3 4 5 6 7 8 9Clock cycle

Latency: 5 cyclesThroughput: 1 instr/cycleConcurrency: 5

IF ID EX MEM WB

CPI = 1CPI =

• Stages must share information. How?• Add pipeline registers (flip‐flops) to pass results

between different stages

extend

registerfile

control

memory

din dout

addrPC

memory

computejump/branch

targets

Fetch Decode Execute Memory WB

Write‐BackMemory

InstructionFetch Execute

InstructionDecode

extend

registerfile

control

memory

din dout

addrPC

memory

IF/ID ID/EX EX/MEM MEM/WB

computejump/branch

targets

• Stages must share information. How?• Add pipeline registers (flip‐flops) to pass results

between different stages

And is this it? Not quite….

3 kinds• Structural hazards

– Multiple instructions want to use same unit

• Data hazards– Results of instruction needed before ready

• Control hazards– Don’t know which side of branch to take

Will get back to thisFirst, how to pipeline when no hazards

Write‐BackMemory

InstructionFetch Execute

InstructionDecode

extend

registerfile

control

memory

din dout

addrPC

memory

computejump/branch

targets

add r3, r1, r2; nand r6, r4, r5; lw r4, 20(r2); add r5, r2, r5; sw r7, 12(r3);

Assume eight‐register machineRun the following code on a pipelined datapath

add r3 r1 r2 ; reg 3 = reg 1 + reg 2nand r6 r4 r5 ; reg 6 = ~(reg 4 & reg 5)lw r4 20 (r2) ; reg 4 = Mem[reg2+20]add r5 r2 r5 ; reg 5 = reg 2 + reg 5sw r7 12(r3) ; Mem[reg3+12] = reg 7

Slides thanks to Sally McKee

Register file

Datamem

Bits 11‐15

Bits 16‐20

PC+4PC+4target

ALUresult

instruction

regAregB

Bits 26‐31

extend

At time 1, Fetchadd r3 r1 r2

Time: 0

Register file

Datamem

Bits 11‐15

Bits 16‐20

add 3 1 2

912187

Bits 26‐31

Fetch:add 3 1 2

add 3 1 2

extend

Time: 1/ 2

Register file

Datamem

Bits 11‐15

Bits 16‐20

0nand 6 4 5

912187

Bits 26‐31

Fetch:nand 6 4 5

nand 6 4 5 add 3 1 2

extend

/ add/ nand

/ 9/ 6

/ 5 / 3

Time: 2/ 3

Register file

Datamem

Bits 11‐15

Bits 16‐20

0lw 4 20(2)

912187

Bits 26‐31

Fetch:lw 4 20(2)

lw 4 20(2) nand 6 4 5 add 3 1 2

extend

nand (18 · 7)

18 = 01 00107 = 00 0111‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐3 = 11 1101

/ ‐3

/ nand

Time: 3/ 4

Register file

Datamem

Bits 11‐15

Bits 16‐20

0add 5 2 5

912187

Bits 26‐31

Fetch:add 5 2 5

add 5 2 5 lw 4 20(2) nand 6 4 5 add 3 1 2

extend

Time: 4

Register file

Datamem

Bits 11‐15

Bits 16‐20

162012

0sw 7 12(3)

945187

Bits 26‐31

Fetch:sw 7 12(3)

sw 7 12(3) add 5 2 5 lw 4 20 (2) nand 6 4 5 add 3 1 2

extend

Time: 5

Register file

Datamem

Bits 11‐15

Bits 16‐20

0945187

Bits 26‐31

No moreinstructions

sw 7 12(3) add 5 2 5 lw 4 20(2) nand 6 4 5

extend

Time: 6

Register file

Datamem

Bits 11‐15

Bits 16‐20

0945997

Bits 26‐31

No moreinstructions

nop nop sw 7 12(3) add 5 2 5 lw 4 20(2)

extend

Time: 7

Register file

Datamem

Bits 11‐15

Bits 16‐20

9459916

Bits 26‐31

No moreinstructions

nop nop nop sw 7 12(3) add 5 2 5

Slides thanks to Sally McKee

extend

Time: 8

Register file

Datamem

Bits 11‐15

Bits 16‐20

9459916

Bits 21‐23

No moreinstructions

nop nop nop nop sw 7 12(3)

extend

Time: 9

Pipelining is a powerful technique to mask latencies and increase throughput

• Logically, instructions execute one at a time• Physically, instructions execute in parallel

– Instruction level parallelism

Abstraction promotes decoupling• Interface (ISA) vs. implementation (Pipeline)

See P&H Chapter: 4.7‐4.8

3 kinds• Structural hazards

– Multiple instructions want to use same unit

• Data hazards– Results of instruction needed before

• Control hazards– Don’t know which side of branch to take

What about data dependencies (also known as a data hazard in a pipelined processor)?i.e. add r3, r1, r2

sub r5, r3, r4

Need to detect and then fix such hazards

Data Hazards• register file reads occur in stage 2 (ID) • register file writes occur in stage 5 (WB)• next instructions may read values about to be written

– i.e instruction may need values that are being computed further down the pipeline

– in fact, this is quite common

IF ID MEM WB

Clock cycle1 2 3 4 5 6 7 8 9

sub r5, r3, r4

lw r6, 4(r3)

or r5, r3, r5

sw r6, 12(r3)

add r3, r1, r2

How many data hazards due to r3 only

A) 1B) 2C) 3D) 4E) 5

sub r5, r3, r4

lw r6, 4(r3)

or r5, r3, r5

sw r6, 12(r3)

add r3, r1, r2

IF ID MEM WB

Clock cycle1 2 3 4 5 6 7 8 9

2. sub r5, r3, r4

3. lw r6, 4(r3)

4. or r5, r3, r5

5. sw r6, 12(r3)

1. add r3, r1, r2

r3 = 10

r3 = 20

r3 = 10 r3 = 20

r3 = 10

IF ID MEM WB

Clock cycle1 2 3 4 5 6 7 8 9

sub r5, r3, r4

lw r6, 4(r3)

or r5, r3, r5

sw r6, 12(r3)

add r3, r1, r2r3 = 10

r3 = 20

r3 = 10 r3 = 20time

r3 = 10

r3 = 20

r6 = Mem[r3 + 4]

i.e. add r3, r1, r2sub r5, r3, r4

How to detect?

ID/EX EX/MEM MEM/WB

din dout

addrinst

instmem

Detecting Data Hazards

IF/ID.Ra ≠ 0 &&(IF/ID.Ra==ID/Ex.RdIF/ID.Ra==Ex/M.RdIF/ID.Ra==M/W.Rd)

How to detect? Logic in ID stage:stall = (IF/ID.Ra != 0 &&

(IF/ID.Ra == ID/EX.Rd || IF/ID.Ra == EX/M.Rd || IF/ID.Ra == M/WB.Rd))

|| (same for Rb)

ID/EX EX/MEM MEM/WB

din dout

instmem

detecthazard Rd

add r3, r1, r2sub r5, r3, r5or r6, r3, r4 add r6, r3, r8

Data hazards occur when a operand (register) depends on the result of a previous instruction that may not be computed yet. A pipelined processor needs to detect data hazards.

What to do if data hazard detected?

What to do if data hazard detected?A) Wait/StallB) Reorder in Software (SW)C) Forward/BypassD) All the aboveE) None. We will use some other method

Discuss today

What to do if data hazard detected?

Options• Nothing

– Change the ISA to match implementation

• Stall– Pause current and subsequent instructions till safe

• Forward/bypass– Forward data value to where it is needed

How to stall an instruction in ID stage• prevent IF/ID pipeline register update

– stalls the ID stage instruction

• convert ID stage instr into nop for later stages– innocuous “bubble” passes through pipeline

• prevent PC update– stalls the next (IF stage) instruction

ID/EX EX/MEM MEM/WB

din dout

instmem

detecthazard Rd

add r3, r1, r2sub r5, r3, r5or r6, r3, r4 add r6, r3, r8

If detect hazard

MemWr=0RegWr=0

Clock cycle1 2 3 4 5 6 7 8

add r3, r1, r2

sub r5, r3, r5

or r6, r3, r4

add r6, r3, r8

Clock cycle1 2 3 4 5 6 7 8

add r3, r1, r2

sub r5, r3, r5

or r6, r3, r4

add r6, r3, r8

r3 = 10

r3 = 20

IF ID Ex M W

IF ID Ex M

ID ID ID

IF IF IF

IF ID Ex

Stalls3 Stall

datamem

Dinstmem

Rd RdRd

/stall

add r3,r1,r2

(MemWr=0RegWr=0)

NOP = If(IF/ID.rA ≠ 0 &&(IF/ID.rA==ID/Ex.RdIF/ID.rA==Ex/M.RdIF/ID.rA==M/W.Rd))

sub r5,r3,r5

or r6,r3,r4 (WE=0)

datamem

Dinstmem

Rd RdRd

/stall

(MemWr=0RegWr=0)

add r3,r1,r2sub r5,r3,r5

(MemWr=0RegWr=0)

or r6,r3,r4 (WE=0)

datamem

Dinstmem

Rd RdRd

/stall

(MemWr=0RegWr=0)

add r3,r1,r2sub r5,r3,r5 nop nop

(MemWr=0RegWr=0)

or r6,r3,r4 (WE=0)

Clock cycle1 2 3 4 5 6 7 8

add r3, r1, r2

sub r5, r3, r5

or r6, r3, r4

add r6, r3, r8

r3 = 10

r3 = 20

IF ID Ex M W

IF ID Ex M

ID ID ID

IF IF IF

IF ID Ex

Stalls3 Stall

How to stall an instruction in ID stage• prevent IF/ID pipeline register update

– stalls the ID stage instruction

• convert ID stage instr into nop for later stages– innocuous “bubble” passes through pipeline

• prevent PC update– stalls the next (IF stage) instruction

Stalling, preventing a dependent instruction from advancing, is one way to resolve data hazards.

Stalling introduces NOPs (“bubbles”) into a pipeline. Introduce NOPs by (1) preventing the PC from updating, (2) preventing writes to IF/ID registers from changing, and (3) preventing writes to memory and register file. *Bubbles in pipeline significantly decrease performance.

What to do if data hazard detected?A) Wait/StallB) Reorder in Software (SW)C) Forward/Bypass

Forwarding bypasses some pipelined stages forwarding a result to a dependent instruction operand (register).

Three types of forwarding/bypass• Forwarding from Ex/Mem registers to Ex stage (MEx)• Forwarding from Mem/WB register to Ex stage (WEx)• RegisterFile Bypass

datamem

Dinstmem

detecthazard

Three types of forwarding/bypass• Forwarding from Ex/Mem registers to Ex stage (MEx)• Forwarding from Mem/WB register to Ex stage (W Ex)• RegisterFile Bypass

IF/ID ID/Ex Ex/Mem Mem/WB

forwardunit

datamem

Dinstmem

forwardunit

detecthazard

Ex/MEM to EX Bypass• EX needs ALU result that is still in MEM stage• Resolve:

Add a bypass from EX/MEM.D to start of EX

How to detect? Logic in Ex Stage:forward = (Ex/M.WE && EX/M.Rd != 0 &&

ID/Ex.Ra == Ex/M.Rd)|| (same for Rb)

datamem

Dinstmem

forwardunit

detecthazard

add r3, r1, r2

sub r5, r3, r1

datamem

instmem

IF ID Ex M W

Mem/WB to EX Bypass• EX needs value being written by WB• Resolve:

Add bypass from WB final value to start of EX

How to detect? Logic in Ex Stage:forward = (M/WB.WE && M/WB.Rd != 0 &&

ID/Ex.Ra == M/WB.Rd &&not (Ex/M.WE && Ex/M.Rd != 0 &&

ID/Ex.Ra == Ex/M.Rd)|| (same for Rb)

Check pg. 311

datamem

Dinstmem

forwardunit

detecthazard

add r3, r1, r2

sub r5, r3, r1

or r6, r3, r4

datamem

instmem

IF ID Ex M W

IF IDIF W

Ex M WID Ex M

Register File Bypass• Reading a value that is currently being written

Detect:((Ra == MEM/WB.Rd) or (Rb == MEM/WB.Rd))and (WB is writing a register)

Resolve:Add a bypass around register file (WB to ID)Better: (Hack) just negate register file clock– writes happen at end of first half of each clock cycle– reads happen during second half of each clock cycle

datamem

Dinstmem

forwardunit

detecthazard

add r3, r1, r2

sub r5, r3, r1

or r6, r3, r4

add r6, r3, r8

datamem

instmem

IF ID Ex M W

IF IDIF W

Ex M WID Ex MIF ID Ex M W

Clock cycle1 2 3 4 5 6 7 8

add r3, r1, r2

sub r5, r3, r5

or r6, r3, r4

add r6, r3, r8

IF ID Ex M W

Ex M W

ID Ex M

IF ID Ex

r3 = 10

r3 = 20

Clock cycle1 2 3 4 5 6 7 8

add r3, r1, r2

sub r5, r3, r4

lw r6, 4(r3)

or r5, r3, r5

sw r6, 12(r3)

IF ID Ex M W

Ex M W

ID Ex M

IF ID Ex

IF ID Ex M W

datamem

Dinstmem

forwardunit

detecthazard

Three types of forwarding/bypass• Forwarding from Ex/Mem registers to Ex stage (MEx)• Forwarding from Mem/WB register to Ex stage (W Ex)• Register File Bypass

Stalling, preventing a dependent instruction from advancing, is one way to resolve data hazards. Stalling introduces NOPs (“bubbles”) into a pipeline. Introduce NOPs by (1) preventing the PC from updating, (2) preventing writes to IF/ID registers from changing, and (3) preventing writes to memory and register file. Bubbles (nops) in pipeline significantly decrease performance.

Forwarding bypasses some pipelined stages forwarding a result to a dependent instruction operand (register). Better performance than stalling.

Stall• Pause current and all subsequent instructions

Forward/Bypass• Try to steal correct value from elsewhere in pipeline• Otherwise, fall back to stalling or require a delay slot

Tradeoffs?

Prof. Hakim Weatherspoon CS 3410, Spring 2015...RISC and Pipelined Processor: Putting it all...

Documents

Transcript of Prof. Hakim Weatherspoon CS 3410, Spring 2015...RISC and Pipelined Processor: Putting it all...

CS 3410: Computer System Organization and Programming Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University.

Hakim Weatherspoon CS 3410 · 2020-01-08 · Calling Conventions Hakim Weatherspoon CS 3410. Computer Science. Cornell University [Weatherspoon, Bala, Bracy, McKee and Sirer]

Memory Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University.

Prof. Hakim Weatherspoon CS 3410, Spring 2015 · 2015. 3. 5. · Prof. Hakim Weatherspoon. CS 3410, Spring 2015. Computer Science. Cornell University. See P&H Appendix 2.16 – 2.18,

Processor Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University See P&H Chapter…

Caches Hakim Weatherspoon CS 3410, Spring 2011 Computer Science Cornell University See P&H 5.1, 5.2 (except writes)

Caches (Writing) Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University P & H Chapter 5.2-3, 5.5.

Memory - Cornell University€¦ · Memory [Weatherspoon, Bala, Bracy, and Sirer] Prof. Hakim Weatherspoon. CS 3410. Computer Science. Cornell University

Hakim Weatherspoon CS 3410, Spring 2013

Introduction - Cornell University · Introduction [Weatherspoon, Bala, Bracy, and Sirer] Prof. Hakim Weatherspoon CS 3410 Computer Science ... programming the IBM 7090 Mary Jackson

Hakim Weatherspoon CS 3410 Computer Science Cornell ......Hakim Weatherspoon CS 3410 Computer Science Cornell University [Weatherspoon, Bala, Bracy, and Sirer] •Prelim next week

Prof. Kavita Bala and Prof. Hakim Weatherspoon CS 3410 ...

Prof. Hakim Weatherspoon CS 3410, Spring 2015 · Prof. Hakim Weatherspoon. CS 3410, Spring 2015. Computer Science. Cornell University “Sometimes it is the people that no one imagines

Hakim Weatherspoon Spring 2012 Computer Science Cornell University CS 3410: Computer System Organization and Programming.

Calling Conventions Hakim Weatherspoon CS 3410, Spring 2011 Computer Science Cornell University See P&H 2.8 and 2.12.

Parallelism, Multicore, and Synchronization€¦ · Parallelism, Multicore, and Synchronization Hakim Weatherspoon CS 3410. Computer Science. Cornell University [Weatherspoon, Bala,

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University See: Online P&H Chapter 6.5-6.

Hakim Weatherspoon CS 3410, Spring 2015 - Cornell University

Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University P&H Chapter 2.11.

Assemblers, Linkers, and Loaders€¦ · Assemblers, Linkers, and Loaders [Weatherspoon, Bala, Bracy, and Sirer] Hakim Weatherspoon. CS 3410. Computer Science. Cornell University