CPSC 614:Graduate Computer Architecture Hazards (Chap. 3)

43
CPSC614 Lec 2.1 Prof. E. J. Kim CPSC 614:Graduate Computer Architecture Hazards (Chap. 3)

description

CPSC 614:Graduate Computer Architecture Hazards (Chap. 3). Prof. E. J. Kim. Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation. Instruction-Level Parallelism (ILP). Instructions are evaluated in parallel. Pipelining Two approaches to exploiting ILP - PowerPoint PPT Presentation

Transcript of CPSC 614:Graduate Computer Architecture Hazards (Chap. 3)

Page 1: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.1

Prof. E. J. Kim

CPSC 614:Graduate Computer Architecture

Hazards (Chap. 3)

Page 2: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.2

Chapter 3

Instruction-Level Parallelism and

Its Dynamic Exploitation

Page 3: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.3

Instruction-Level Parallelism (ILP)

• Instructions are evaluated in parallel.• Pipelining• Two approaches to exploiting ILP

– Hardware-dependent (chapter 3)» Intel Pentium 3 & 4, Athlon, MIPS

R10000/12000, Sun UltraSPARC III, PowerPC, …

– Software-dependent (chapter 4)» IA-64, Intel Itanium, embedded processors

Page 4: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.4

Pipeline CPI =

Ideal CPI + Structural stalls

+ Data hazard stalls + Control stalls

Page 5: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.5

Techniques to Decrease Pipeline CPI

• Forwarding and Bypassing• Delayed Branches and Simple Branch Scheduling• Basic Dynamic Scheduling (Scoreboarding)• Dynamic Scheduling with Renaming• Dynamic Branch Prediction• Issuing Multiple Instructions per Cycle• Speculation• Dynamic Memory Disambiguation

Page 6: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.6

Techniques to Decrease Pipeline CPI

• Loop Unrolling• Basic Compiler Pipeline Scheduling• Compiler Dependence Analysis• Software Pipelining, Trace Scheduling• Compiler Speculation

Page 7: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.7

Data Dependences

• If two instructions are parallel, they can execute simultaneously.

• If two instructions are dependent, they must be executed in order.

• How to determine an instruction is dependent on anther instruction?

Page 8: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.8

Data Dependences• Data dependences (True data dependences)• Name dependences• Control dependences• An instruction j is dependent on instruction

i if either – i produces a result that may be used by j, or– j is data dependent on instruction k, and k is data

dependent on i.

Page 9: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.9

Loop: L.D F0, 0(R1) ;F0=array element

ADD.D F4, F0, F2 ;add scalar in F2

S.D F4, 0(R1) ;store the result

DADDUI R1, R1, #-8 ;decrement pointer 8 bytes

BNE R1, R2, LOOP ;branch R1 != R2

Floating-point data dependences

Integer data dependence

Page 10: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.10

• A dependence– indicates the possibility of a hazard,– determines the order in which results must be

calculated, and– sets an upper bound on how much parallelism

can be possibly be exploited.

Page 11: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.11

How to Overcome a Dependence

• Maintaining the dependence but avoiding a hazard

• Eliminating a dependence by transforming the code

Page 12: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.12

Name Dependence

• Occurs then two instructions use the same register or memory location (name), but there is no flow of data between the instructions associated with that name.

• When i precedes j in program order:– Antidependence: Instruction j writes a register

or memory location that instruction i reads.– Output dependence: Instructions i and j write

the same register or memory location.

Page 13: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.13

Register Renaming

• Instructions involved in a name dependence can execute simultaneously or be reordered, if the name (register number or memory location) used in the instructions is changed so the instructions do not conflict. (Especially for register operands)

Page 14: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.14

Data Hazards• Goal: Preserve the program order only

where it affects the outcome of the program to maximize ILP.

• When instruction i occurs before instruction j in program order,– RAW (Read after Write): j tries to read a source before i

writes it.– WAW (Write after Write): j tries to write an operand

before it is written by i.– WAR (Write after Read): j tries to write a destination

before it is read by i.

Page 15: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.15

Control Dependence

• Caused by branch instructions• An instruction that is control

dependent on a branch cannot be moved before the branch.

• An instruction that is not control dependent on a branch cannot be moved after the branch.

Page 16: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.16

• Control dependence is not the critical property that must be preserved.

• We can violate the control dependences, if we can do so without affecting the correctness of the program. (e.g. branch prediction)

Page 17: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.17

5 Steps of MIPS Datapath

MemoryAccess

Write

Back

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc

ALU

Mem

ory

Reg File

MUX

MUX

Mem

ory

MUX

SignExtend

Zero?

IF/ID

ID/EX

MEM

/WB

EX/MEM

4

Adder

Next SEQ PC Next SEQ PC

RD RD RD WB

Data

Next PC

Address

RS1

RS2

Imm

MUX

Datapath

Control Path

Page 18: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.18

5 Steps of MIPS Datapath

MemoryAccess

Write

Back

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc

ALU

Mem

ory

Reg File

MUX

MUX

Mem

ory

MUX

SignExtend

Zero?

IF/ID

ID/EX

MEM

/WB

EX/MEM

4

Adder

Next SEQ PC Next SEQ PC

RD RD RD WB

Data

Next PC

Address

RS1

RS2

Imm

MUX

Datapath

Control Path

Inst

1

Inst

1

Inst

2

Inst

1

Inst

2

Inst

3

Page 19: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.19

Review: Visualizing PipeliningFigure 3.3, Page 133 , CA:AQA 2e

Instr.

Order

Time (clock cycles)

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5

Page 20: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.20

Limits to pipelining

• Hazards: circumstances that would cause incorrect execution if next instruction were launched

– Structural hazards: Attempting to use the same hardware to do two different things at the same time

– Data hazards: Instruction depends on result of prior instruction still in the pipeline

– Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).

Page 21: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.21

Example: One Memory Port/Structural Hazard

Figure 3.6, Page 142 , CA:AQA 2e

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Instr 3

Instr 4

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5

DMem

Structural Hazard

Page 22: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.22

Resolving structural hazards

• Defn: attempt to use same hardware for two different things at the same time

• Solution 1: Wait must detect the hazard must have mechanism to stall

• Solution 2: Throw more hardware at the problem

Page 23: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.23

Detecting and Resolving Structural Hazard

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Stall

Instr 3

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5

Reg ALU

DMemIfetch Reg

Bubble Bubble Bubble BubbleBubble

Page 24: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.24

Eliminating Structural Hazards at Design Time

ALU

InstrCache

Reg File

MUX

MUX

DataCache

MUX

SignExtend

Zero?

IF/ID

ID/EX

MEM

/WB

EX/MEM

4

Adder

Next SEQ PC Next SEQ PC

RD RD RD WB

Data

Next PC

Address

RS1

RS2

Imm

MUX

Datapath

Control Path

Page 25: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.25

Instr.

Order

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

Reg ALU DMemIfetch Reg

Reg ALU DMemIfetch Reg

Data HazardsFigure 3.9, page 147 , CA:AQA 2e

Time (clock cycles)

IF ID/RF EX MEM WB

Reg ALU DMemIfetch Reg

Reg ALU DMemIfetch Reg

Reg ALU DMemIfetch Reg

Page 26: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.26

• Read After Write (RAW) InstrJ tries to read operand before InstrI writes it

• Caused by a “Data Dependence” (in compiler nomenclature). This hazard results from an actual need for communication.

Three Generic Data Hazards

I: add r1,r2,r3J: sub r4,r1,r3

Page 27: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.27

• Write After Read (WAR) InstrJ writes operand before InstrI reads it

• Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”.

• Can’t happen in MIPS 5 stage pipeline because:– All instructions take 5 stages, and– Reads are always in stage 2, and – Writes are always in stage 5

I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7

Three Generic Data Hazards

Page 28: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.28

Three Generic Data Hazards• Write After Write (WAW)

InstrJ writes operand before InstrI writes it.

• Called an “output dependence” by compiler writersThis also results from the reuse of name “r1”.

• Can’t happen in MIPS 5 stage pipeline because: – All instructions take 5 stages, and – Writes are always in stage 5

• Will see WAR and WAW in later more complicated pipes

I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7

Page 29: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.29

Time (clock cycles)

Forwarding to Avoid Data HazardFigure 3.10, Page 149 , CA:AQA 2e

Inst

r.

Order

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

Reg ALU DMemIfetch Reg

Reg ALU DMemIfetch Reg

Reg ALU DMemIfetch Reg

Reg ALU DMemIfetch Reg

Reg ALU DMemIfetch Reg

Page 30: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.30

HW Change for ForwardingFigure 3.20, Page 161, CA:AQA 2e

MEM

/WR

ID/EX

EX/MEM

DataMemory

ALU

mux

mux

Registers

NextPC

Immediate

mux

Page 31: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.31

Time (clock cycles)

Instr.

Order

lw r1, 0(r2)

sub r4,r1,r6

and r6,r1,r7

or r8,r1,r9

Data Hazard Even with ForwardingFigure 3.12, Page 153 , CA:AQA 2e

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Page 32: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.32

Resolving this load hazard

• Adding hardware? ... not• Detection?• Compilation techniques?

• What is the cost of load delays?

Page 33: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.33

Resolving the Load Data Hazard

Time (clock cycles)

or r8,r1,r9

Instr.

Order

lw r1, 0(r2)

sub r4,r1,r6

and r6,r1,r7

Reg ALU

DMemIfetch Reg

RegIfetch ALU

DMem RegBubble

Ifetch ALU

DMem RegBubble Reg

Ifetch ALU

DMemBubble Reg

Page 34: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.34

Try producing fast code fora = b + c;d = e – f;

assuming a, b, c, d ,e, and f in memory. Slow code:

LW Rb,bLW Rc,cADD Ra,Rb,RcSW a,Ra LW Re,e LW Rf,fSUB Rd,Re,RfSW d,Rd

Software Scheduling to Avoid Load Hazards

Fast code:LW Rb,bLW Rc,cLW Re,e ADD Ra,Rb,RcLW Rf,fSW a,Ra SUB Rd,Re,RfSW d,Rd

Page 35: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.35

Instruction Set Connection

• What is exposed about this organizational hazard in the instruction set?

• k cycle delay?– bad, CPI is not part of ISA

• k instruction slot delay– load should not be followed by use of the value in

the next k instructions• Nothing, but code can reduce run-time

delays• MIPS did the transformation in the

assembler

Page 36: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.36

Historical Perspective: Microprogramming

MainMemory

executionunit

controlmemory

CPU

ADDSUBAND

DATA

.

.

.

User program plus Data

this can change!

one of these ismapped into oneof these

Supported complex instructions a sequence of simple micro-inst (RTs)Pipelined micro-instruction processing, but very limited view.Could not reorganize macroinstructions to enable pipelining

Page 37: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.37

Control Hazard on Branches=> Three Stage Stall

10: beq r1,r3,36

14: and r2,r3,r5

18: or r6,r1,r7

22: add r8,r1,r9

36: xor r10,r1,r11

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Page 38: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.38

Example: Branch Stall Impact

• If 30% branch, Stall 3 cycles significant• Two part solution:

– Determine branch taken or not sooner, AND– Compute taken branch address earlier

• MIPS branch tests if register = 0 or 0• MIPS Solution:

– Move Zero test to ID/RF stage– Adder to calculate new PC in ID/RF stage– 1 clock cycle penalty for branch versus 3

Page 39: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.39

Adder

IF/ID

Pipelined MIPS DatapathFigure 3.22, page 163, CA:AQA 2/e

MemoryAccess

Write

Back

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc

ALU

Mem

ory

Reg File

MUX

DataM

emory

MUX

SignExtend

Zero?

MEM

/WB

EX/MEM

4

Adder

Next SEQ PC

RD RD RD WB

Data

• Data stationary control– local decode for each instruction phase / pipeline stage

Next PC

Address

RS1

RS2

ImmM

UX

ID/EX

Page 40: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.40

Four Branch Hazard Alternatives#1: Stall until branch direction is clear#2: Predict Branch Not Taken

– Execute successor instructions in sequence– “Squash” instructions in pipeline if branch actually taken– Advantage of late pipeline state update– 47% MIPS branches not taken on average– PC+4 already calculated, so use it to get next instruction

#3: Predict Branch Taken– 53% MIPS branches taken on average– But haven’t calculated branch target address in MIPS

» MIPS still incurs 1 cycle branch penalty» Other machines: branch target known before outcome

Page 41: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.41

Four Branch Hazard Alternatives

#4: Delayed Branch– Define branch to take place AFTER a following

instruction

branch instructionsequential successor1sequential successor2........sequential successorn

........branch target if taken

– 1 slot delay allows proper decision and branch target address in 5 stage pipeline

– MIPS uses this

Branch delay of length n

Page 42: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.42

Delayed Branch• Where to get instructions to fill branch delay slot?

– Before branch instruction– From the target address: only valuable when branch taken– From fall through: only valuable when branch not taken– Canceling branches allow more slots to be filled

• Compiler effectiveness for single branch delay slot:– Fills about 60% of branch delay slots– About 80% of instructions executed in branch delay slots useful in

computation– About 50% (60% x 80%) of slots usefully filled

• Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)

Page 43: CPSC 614:Graduate Computer Architecture  Hazards (Chap. 3)

CPSC614Lec 2.43

Example: Evaluating Branch Alternatives

Assume: Conditional & Unconditional = 14%, 65% change PC

Scheduling BranchCPIspeedup v. scheme penaltystall

Stall pipeline 31.421.0Predict taken 11.141.26Predict not taken 11.091.29Delayed branch 0.51.071.31

Pipeline speedup = Pipeline depth1 +Branch frequencyBranch penalty