Computer Organization Lecture Set – 06 Chapter 6 Huei-Yung Lin.

Computer Organization

Lecture Set – 06

Chapter 6

Huei-Yung Lin

H.Y. Lin, CCUEE Computer Organization 2

Overview / Abstractions and Technology Performance Instruction sets Logic & arithmetic Processor Implementation

Single-cycle implemenatation Multicycle implementation Pipelined Implementation

Memory systems Input/Output

Roadmap for the Term: Major Topics


Pipelining Outline

Introduction Defining Pipelining Pipelining Instructions Hazards

Pipelined Processor Design Datapath Control

Advanced Pipelining Superscalar Dynamic Pipelining Examples


What is Pipelining?

A way of speeding up execution of instructions Key idea: overlap execution of multiple instructions Analogy: doing your laundry

1. Run load through washer

2. Run load through dryer

3. Fold clothes

4. Put away clothes

5. Go to 1

Observation: we can start another load as soon as we finish step 1!


The Laundry Analogy

Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold

Washer takes 30 minutes

Dryer takes 30 minutes

“Folder” takes 30 minutes

“Stasher” takes 30 minutesto put clothes into drawers

A B C D


If we do laundry sequentially...

30Task

Order

TimeA

3030 3030

B

30 3030

C

3030 3030

D

3030 3030

6 PM 7 8 9 10 11 12 1 2 AM

Time Required: 8 hours for 4 loads


12 2 AM6 PM 7 8 9 10 11 1

Time30

A

C

D

B

3030 3030 3030Task

Order

To Pipeline, We Overlap Tasks

Time Required: 3.5 Hours for 4 Loads Latency remains 2 hours Throughput improves by factor of 2.3 (decreases for more loads)


Pipelining a Digital System

Key idea: break big computation up into pieces

Separate each piece with a pipeline register1ns

200ps 200ps 200ps 200ps 200ps

PipelineRegister


Pipelining a Digital System

Why do this? Because it's faster for repeated computations

1ns

Non-pipelined:1 operation finishesevery 1ns

200ps 200ps 200ps 200ps 200ps

Pipelined:1 operation finishesevery 200ps


Comments about pipelining

Pipelining increases throughput, but not latency Answer available every 200ps, BUT A single computation still takes 1ns

Limitations: Computations must be divisible into stage size Pipeline registers add overhead


Pipelining a Processor

Recall the 5 steps in instruction execution:1. Instruction Fetch2. Instruction Decode and Register Read3. Execution operation or calculate address4. Memory access5. Write result into register

Review: Single-Cycle Processor All 5 steps done in a single clock cycle Dedicated hardware required for each step

What happens if we break execution into multiple cycles, but keep the extra hardware?


Review - Single-Cycle Processor

5 516

RD1

RD2

RN1 RN2 WN

WD

Register File ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

5

Instruction I32

MUX

<<2RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

32

IFInstruction Fetch

IDInstruction Decode

EXExecute/ Address Calc.

MEMMemory Access

WBWrite Back


Pipelining - Key Idea

Question: What happens if we break execution into multiple cycles, but

keep the extra hardware? Answer:

In the best case, we can start executing a new instruction on each clock cycle – this is pipelining

Pipelining stages: IF - Instruction Fetch ID - Instruction Decode EX - Execute / Address Calculation MEM - Memory Access (read / write) WB - Write Back (results into register file)


Basic Pipelined Processor

IF/ID

Pipeline Registers

5 516

RD1

RD2

RN1 RN2 WN

WD

Register File ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

5

Instruction I32

MUX

<<2RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

32

ID/EX EX/MEM MEM/WB


Single-Cycle vs. Pipelined Execution

Non-Pipelined0 200 400 600 800 1000 1200 1400 1600 1800

lw $1, 100($0) InstructionFetch

REGRD

ALU REGWR

MEM


REGRD

ALU REGWR

MEM


TimeInstructionOrder

800ps

800ps

800ps

Pipelined0 200 400 600 800 1000 1200 1400 1600


REGRD

ALU REGWR

MEM

lw $2, 200($0)

lw $3, 300($0)

TimeInstructionOrder

200psInstructionFetch

REGRD

ALU REGWR

MEM

InstructionFetch

REGRD

ALU REGWR

MEM

200ps

200ps 200ps 200ps 200ps 200ps


Comments about Pipelining

The good news Multiple instructions are being processed at same time This works because stages are isolated by registers Best case speedup of N

The bad news Instructions interfere with each other – hazards

Example: different instructions may need the same piece of hardware (e.g., memory) in same clock cycle

Example: instruction may require a result produced by an earlier instruction that is not yet complete

Worst case: must suspend execution – stall


Consider the following instruction sequence:lw $r0, 10($r1)

sw $sr3, 20($r4)

add $r5, $r6, $r7

sub $r8, $r9, $r10

Pipelined Example – Executing Multiple Instructions


LW

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

Executing Multiple InstructionsClock Cycle 1

lw $r0, 10($r1)sw $sr3, 20($r4)add $r5, $r6, $r7sub $r8, $r9, $r10


5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5


Zero

LWSW




5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5


Zero

LWSWADD




5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5


Zero

LWSWADDSUB




5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5


Zero

SWADDSUB




5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5


Zero

ADDSUB




5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5


Zero

SUB




Alternative View – Multicycle Diagram

IM REG ALU DM REGlw $r0, 10($r1)

sw $r3, 20($r4)

add $r5, $r6, $r7

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7

IM REG ALU DM REG

IM REG ALU DM REG

sub $r8, $r9, $r10 IM REG ALU DM REG

CC 8


Pipeline Hazards

Where one instruction cannot immediately follow another

Types of hazards Structural hazards – attempt to use same resource twice Control hazards – attempt to make decision before condition is

evaluated Data hazards – attempt to use data before it is ready

Can always resolve hazards by waiting


Structural Hazards

Attempt to use same resource twice at same time Example: Single Memory for instructions, data

Accessed by IF stage Accessed at same time by MEM stage

Solutions Delay second access by one clock cycle, OR Provide separate memories for instructions, data

This is what the book does This is called a “Harvard Architecture” Real pipelined processors have separate caches


0 2 4 6 8 10Time

12

IF ID EX MEM WB

14

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

14

Memory Conflict

Example Structural Hazard – Single Memory


Control Hazards

Attempt to make a decision before condition is evaluated Example: beq $s0, $s1, offset Assume we add hardware to second stage to:

Compare fetched registers for equality Compute branch target

This allows branch to be taken at end of second clock cycle

But, this still means result is not ready when we want to load the next instruction!


Control Hazard Solutions

Stall – stop loading instructions until result is available

Predict – assume an outcome and continue fetching (undo if prediction is wrong)

Delayed branch – specify in architecture that following instruction is always executed


Control Hazard – Stall

beqwrites PC

here

new PCused here

0 2 4 6 8 10 12

IF ID EX MEM WB

16

add $r4,$r5,$r6

beq $r0,$r1,tgt IF ID EX MEM WB

IF ID EX MEM WBsw $s4,200($t5)

18

BUBBLE BUBBLE BUBBLE BUBBLE BUBBLE

STALL


Control Hazard – Correct Prediction

Fetch assumingbranch taken

0 2 4 6 8 10 12

IF ID EX MEM WB

16

add $r4,$r5,$r6


IF ID EX MEM WBtgt:sw $s4,200($t5)

18


Control Hazard – Incorrect Prediction

“ Squashed”instruction

0 2 4 6 8 10 12

IF ID EX MEM WB

16

add $r4,$r5,$r6


IF ID EX MEM WB

18

BUBBLE BUBBLE BUBBLE BUBBLE

tgt:sw $s4,200($t5)(incorrect - STALL)

IF

or $r8,$r8,$r9


Control Hazard – Delayed Branch

always executes

correct PC avail. here

0 2 4 6 8 10 12

IF ID EX MEM WB

16

add $r4,$r5,$r6


IF ID EX MEM WB

18

Branch SLOT:and $r6,$r6,$r7

tgt:sw $s4,200($t5) IF ID EX MEM WB


Summary – Control Hazard Solutions Stall – stop fetching instr. until result is available

Significant performance penalty Hardware required to stall

Predict – assume an outcome and continue fetching (undo if prediction is wrong)

Performance penalty only when guess wrong Hardware required to "squash" instructions

Delayed branch – specify in architecture that following instruction is always executed

Compiler re-orders instructions into delay slot Insert "NOP" (no-op) operations when can't use (~50%) This is how original MIPS worked


Data Hazards

Attempt to use data before it is ready Solutions

Stalling – wait until result is available Forwarding – make data available inside datapath Reordering instructions – use compiler to avoid hazards

Examples:add $s0, $t0, $t1 ; $s0 = $t0+$t1sub $t2, $s0, $t3 ; $t2 = $s0-$t2

lw $s0, 0($t0) ; $s0 = MEM[$t0]sub $t2, $s0, $t3 ; $t2 = $s0-$t2


Data Hazard – Stalling

0 2 4 6 8 10 12

IF ID EX MEM

16

add $s0,$t0,$t1

STALL

18

sub $t2,$s0,$t3 IF EX MEM

STALL

BUBBLE BUBBLE BUBBLE BUBBLE

BUBBLEBUBBLE BUBBLE BUBBLE BUBBLE

$s0writtenhere

Ws0

WB

$s0 readhere

Rs0

BUBBLE


Data Hazards – Forwarding

Key idea: connect new value directly to next stage Still read s0, but ignore in favor of new result

Problem: what about load instructions?

ID

0 2 4 6 8 10 12

IF ID EX MEM

16

add $s0 ,$t0,$t1

18

sub $t2, $s0 ,$t3 IF EX MEM

Ws0

WBRs0

new value of s0


Data Hazards – Forwarding

STALL still required for load – data avail. after MEM MIPS architecture calls this delayed load, initial implementations

required compiler to deal with this

ID

0 2 4 6 8 10 12

IF ID EX MEM

16

lw $s0,20($t1)

18

sub $t2,$s0,$t3 IF EX MEM

Ws0

WBRs0

new value of s0

STALLBUBBLE BUBBLE BUBBLE BUBBLE BUBBLE


Assuming we have data forwarding, what are the hazards in this code?

lw $t0, 0($t1)lw $t2, 4($t1)sw $t2, 0($t1)sw $t0, 4($t1)

Reorder instructions to remove hazard:lw $t0, 0($t1)lw $t2, 4($t1)sw $t0, 4($t1)sw $t2, 0($t1)

Data Hazards – Reordering Instructions


Summary - Pipelining Overview Pipelining increase throughput (but not latency) Hazards limit performance

Structural hazards Control hazards Data hazards


Pipelining Outline

Introduction Pipelined Processor Design

Datapath Control Dealing with Hazards & Forwarding Branch Prediction Exceptions Performance

Advanced Pipelining Superscalar Dynamic Pipelining Examples


Pipelining in MIPS

MIPS architecture was designed to be pipelined Simple instruction format (makes IF, ID easy)

Single-word instructions Small number of instruction formats Common fields in same place (e.g., rs, rt) in different formats

Memory operations only in lw, sw instructions (simplifies EX)

Memory operands aligned in memory (simplifies MEM) Single value for writeback (limits forwarding)

Pipelining is harder in CISC architectures


MemtoReg5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

5

5

5

IF/IDID/EX

EX/MEM MEM/WB

Zero

0

1

MemRead

ALUSrc

MemWrite

ALUControl6

ALUOp0

1

RegDst5

rs

rt

rt

rd

RegWrite

immed

Branch

0

1

PCSrc PCSrc

0

1

Pipelined Datapath with Control Signals


Next Step: Adding Control

Basic approach: build on single-cycle control Place control unit in ID stage Pass control signals to following stages

Later: extra features to deal with: Data forwarding Stalls Exceptions


Control for Pipelined Datapath

Source: Book Fig. 6.29, p 469

EX

M

WB

Control

IF / ID ID / EX EX / MEM MEM / WB

M

WB

WB

RegDstALUOp[1:0]ALUSrc

MemReadMemWriteBranch

RegWriteMemtoReg


Control for Pipelined Datapath

Execution/Address Calculation stage control lines

Memory access stage control lines

Write-back stage control lines

Instruction Reg DstALU Op1

ALU Op0 ALU Src Branch

Mem Read

Mem Write

Reg write

Mem to Reg

R-format 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X

EX

M

WB

Control

IF / ID ID / EX EX / MEM MEM / WB

M

WB

WB

Source: Book Fig. 6.25, p 401


Datapath and Control Unit

W

M WE

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

5

5

5


Zero

0

1

MemRead

ALUSrc

ALUControl6

ALUOp0

1

RegDst

5

rs

rt

rt

rd

RegWrite

immed

Branch

0

1

PCSrcRegWrite

0

1

W

MControl


Tracking Control Signals - Cycle 1

LW

W

M WE

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

5

5

5


Zero

0

1

MemRead

ALUSrc

ALUControl6

ALUOp0

1

RegDst

5

rs

rt

rt

rd

RegWrite

immed

Branch

0

1

PCSrcRegWrite

0

1

W

MControl



SW LW

W

M WE

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

5

5

5


Zero

0

1

MemRead

ALUSrc

ALUControl6

ALUOp0

1

RegDst

5

rs

rt

rt

rd

RegWrite

immed

Branch

0

1

PCSrcRegWrite

0

1

W

MControl



ADD SW LW

001

1

W

M WE

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

5

5

5


Zero

0

1

MemRead

ALUSrc

ALUControl6

ALUOp0

1

RegDst

5

rs

rt

rt

rd

RegWrite

immed

Branch

0

1

PCSrcRegWrite

0

1

W

MControl



SUB ADD SW LW

1

0

0

W

M WE

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

5

5

5


Zero

0

1

MemRead

ALUSrc

ALUControl6

ALUOp0

1

RegDst

5

rs

rt

rt

rd

RegWrite

immed

Branch

0

1

PCSrcRegWrite

0

1

W

MControl


1

1

ADD


SUB SW LW

W

M WE

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

5

5

5


Zero

0

1

MemRead

ALUSrc

ALUControl6

ALUOp0

1

RegDst

5

rs

rt

rt

rd

RegWrite

immed

Branch

0

1

PCSrcRegWrite

0

1

W

MControl


References

Portions of these slides are derived from: Textbook figures © 1998 Morgan Kaufmann Publishers all rights

reserved Tod Amon's COD2e Slides © 1998 Morgan Kaufmann

Publishers all rights reserved Dave Patterson’s CS 152 Slides – Fall 1997 © UCB Rob Rutenbar’s 18-347 Slides – Fall 1999 CMU John Nestor’s ECE 313 Slides – Fall 2004 LC T.S. Chang’s DEE 1050 Slides – Fall 2004 NCTU Other sources as noted

Computer Organization Lecture Set – 06 Chapter 6 Huei-Yung Lin.

Documents

Transcript of Computer Organization Lecture Set – 06 Chapter 6 Huei-Yung Lin.