CPSC 614:Graduate Computer Architecture Hazards (Chap. 3)

CPSC614Lec 2.1

Prof. E. J. Kim

CPSC 614:Graduate Computer Architecture

Hazards (Chap. 3)

CPSC614Lec 2.2

Chapter 3

Instruction-Level Parallelism and

Its Dynamic Exploitation

CPSC614Lec 2.3

Instruction-Level Parallelism (ILP)

• Instructions are evaluated in parallel.• Pipelining• Two approaches to exploiting ILP

– Hardware-dependent (chapter 3)» Intel Pentium 3 & 4, Athlon, MIPS

R10000/12000, Sun UltraSPARC III, PowerPC, …

– Software-dependent (chapter 4)» IA-64, Intel Itanium, embedded processors

CPSC614Lec 2.4

Pipeline CPI =

Ideal CPI + Structural stalls

+ Data hazard stalls + Control stalls

CPSC614Lec 2.5

Techniques to Decrease Pipeline CPI

• Forwarding and Bypassing• Delayed Branches and Simple Branch Scheduling• Basic Dynamic Scheduling (Scoreboarding)• Dynamic Scheduling with Renaming• Dynamic Branch Prediction• Issuing Multiple Instructions per Cycle• Speculation• Dynamic Memory Disambiguation

CPSC614Lec 2.6

Techniques to Decrease Pipeline CPI

• Loop Unrolling• Basic Compiler Pipeline Scheduling• Compiler Dependence Analysis• Software Pipelining, Trace Scheduling• Compiler Speculation

CPSC614Lec 2.7

Data Dependences

• If two instructions are parallel, they can execute simultaneously.

• If two instructions are dependent, they must be executed in order.

• How to determine an instruction is dependent on anther instruction?

CPSC614Lec 2.8

Data Dependences• Data dependences (True data dependences)• Name dependences• Control dependences• An instruction j is dependent on instruction

i if either – i produces a result that may be used by j, or– j is data dependent on instruction k, and k is data

dependent on i.

CPSC614Lec 2.9

Loop: L.D F0, 0(R1) ;F0=array element

ADD.D F4, F0, F2 ;add scalar in F2

S.D F4, 0(R1) ;store the result

DADDUI R1, R1, #-8 ;decrement pointer 8 bytes

BNE R1, R2, LOOP ;branch R1 != R2

Floating-point data dependences

Integer data dependence

CPSC614Lec 2.10

• A dependence– indicates the possibility of a hazard,– determines the order in which results must be

calculated, and– sets an upper bound on how much parallelism

can be possibly be exploited.

CPSC614Lec 2.11

How to Overcome a Dependence

• Maintaining the dependence but avoiding a hazard

• Eliminating a dependence by transforming the code

CPSC614Lec 2.12

Name Dependence

• Occurs then two instructions use the same register or memory location (name), but there is no flow of data between the instructions associated with that name.

• When i precedes j in program order:– Antidependence: Instruction j writes a register

or memory location that instruction i reads.– Output dependence: Instructions i and j write

the same register or memory location.

CPSC614Lec 2.13

Register Renaming

• Instructions involved in a name dependence can execute simultaneously or be reordered, if the name (register number or memory location) used in the instructions is changed so the instructions do not conflict. (Especially for register operands)

CPSC614Lec 2.14

Data Hazards• Goal: Preserve the program order only

where it affects the outcome of the program to maximize ILP.

• When instruction i occurs before instruction j in program order,– RAW (Read after Write): j tries to read a source before i

writes it.– WAW (Write after Write): j tries to write an operand

before it is written by i.– WAR (Write after Read): j tries to write a destination

before it is read by i.

CPSC614Lec 2.15

Control Dependence

• Caused by branch instructions• An instruction that is control

dependent on a branch cannot be moved before the branch.

• An instruction that is not control dependent on a branch cannot be moved after the branch.

CPSC614Lec 2.16

• Control dependence is not the critical property that must be preserved.

• We can violate the control dependences, if we can do so without affecting the correctness of the program. (e.g. branch prediction)

CPSC614Lec 2.17

5 Steps of MIPS Datapath

MemoryAccess

Write

Back

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc

ALU

Mem

ory

Reg File

MUX

MUX

Mem

ory

MUX

SignExtend

Zero?

IF/ID

ID/EX

MEM

/WB

EX/MEM

4

Adder

Next SEQ PC Next SEQ PC

RD RD RD WB

Data

Next PC

Address

RS1

RS2

Imm

MUX

Datapath

Control Path

CPSC614Lec 2.18

5 Steps of MIPS Datapath

MemoryAccess

Write

Back

InstructionFetch


ExecuteAddr. Calc

ALU

Mem

ory

Reg File

MUX

MUX

Mem

ory

MUX

SignExtend

Zero?

IF/ID

ID/EX

MEM

/WB

EX/MEM

4

Adder


RD RD RD WB

Data

Next PC

Address

RS1

RS2

Imm

MUX

Datapath

Control Path

Inst

1

Inst

1

Inst

2

Inst

1

Inst

2

Inst

3

CPSC614Lec 2.19

Review: Visualizing PipeliningFigure 3.3, Page 133 , CA:AQA 2e

Instr.

Order

Time (clock cycles)

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5

CPSC614Lec 2.20

Limits to pipelining

• Hazards: circumstances that would cause incorrect execution if next instruction were launched

– Structural hazards: Attempting to use the same hardware to do two different things at the same time

– Data hazards: Instruction depends on result of prior instruction still in the pipeline

– Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).

CPSC614Lec 2.21

Example: One Memory Port/Structural Hazard

Figure 3.6, Page 142 , CA:AQA 2e

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Instr 3

Instr 4

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg


DMem

Structural Hazard

CPSC614Lec 2.22

Resolving structural hazards

• Defn: attempt to use same hardware for two different things at the same time

• Solution 1: Wait must detect the hazard must have mechanism to stall

• Solution 2: Throw more hardware at the problem

CPSC614Lec 2.23

Detecting and Resolving Structural Hazard

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Stall

Instr 3

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg


Reg ALU

DMemIfetch Reg

Bubble Bubble Bubble BubbleBubble

CPSC614Lec 2.24

Eliminating Structural Hazards at Design Time

ALU

InstrCache

Reg File

MUX

MUX

DataCache

MUX

SignExtend

Zero?

IF/ID

ID/EX

MEM

/WB

EX/MEM

4

Adder


RD RD RD WB

Data

Next PC

Address

RS1

RS2

Imm

MUX

Datapath

Control Path

CPSC614Lec 2.25

Instr.

Order

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

Reg ALU DMemIfetch Reg


Data HazardsFigure 3.9, page 147 , CA:AQA 2e

Time (clock cycles)

IF ID/RF EX MEM WB




CPSC614Lec 2.26

• Read After Write (RAW) InstrJ tries to read operand before InstrI writes it

• Caused by a “Data Dependence” (in compiler nomenclature). This hazard results from an actual need for communication.

Three Generic Data Hazards

I: add r1,r2,r3J: sub r4,r1,r3

CPSC614Lec 2.27

• Write After Read (WAR) InstrJ writes operand before InstrI reads it

• Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”.

• Can’t happen in MIPS 5 stage pipeline because:– All instructions take 5 stages, and– Reads are always in stage 2, and – Writes are always in stage 5

I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7

Three Generic Data Hazards

CPSC614Lec 2.28

Three Generic Data Hazards• Write After Write (WAW)

InstrJ writes operand before InstrI writes it.

• Called an “output dependence” by compiler writersThis also results from the reuse of name “r1”.

• Can’t happen in MIPS 5 stage pipeline because: – All instructions take 5 stages, and – Writes are always in stage 5

• Will see WAR and WAW in later more complicated pipes

I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7

CPSC614Lec 2.29

Time (clock cycles)

Forwarding to Avoid Data HazardFigure 3.10, Page 149 , CA:AQA 2e

Inst

r.

Order

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11






CPSC614Lec 2.30

HW Change for ForwardingFigure 3.20, Page 161, CA:AQA 2e

MEM

/WR

ID/EX

EX/MEM

DataMemory

ALU

mux

mux

Registers

NextPC

Immediate

mux

CPSC614Lec 2.31

Time (clock cycles)

Instr.

Order

lw r1, 0(r2)

sub r4,r1,r6

and r6,r1,r7

or r8,r1,r9

Data Hazard Even with ForwardingFigure 3.12, Page 153 , CA:AQA 2e

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

CPSC614Lec 2.32

Resolving this load hazard

• Adding hardware? ... not• Detection?• Compilation techniques?

• What is the cost of load delays?

CPSC614Lec 2.33

Resolving the Load Data Hazard

Time (clock cycles)

or r8,r1,r9

Instr.

Order

lw r1, 0(r2)

sub r4,r1,r6

and r6,r1,r7

Reg ALU

DMemIfetch Reg

RegIfetch ALU

DMem RegBubble

Ifetch ALU

DMem RegBubble Reg

Ifetch ALU

DMemBubble Reg

CPSC614Lec 2.34

Try producing fast code fora = b + c;d = e – f;

assuming a, b, c, d ,e, and f in memory. Slow code:

LW Rb,bLW Rc,cADD Ra,Rb,RcSW a,Ra LW Re,e LW Rf,fSUB Rd,Re,RfSW d,Rd

Software Scheduling to Avoid Load Hazards

Fast code:LW Rb,bLW Rc,cLW Re,e ADD Ra,Rb,RcLW Rf,fSW a,Ra SUB Rd,Re,RfSW d,Rd

CPSC614Lec 2.35

Instruction Set Connection

• What is exposed about this organizational hazard in the instruction set?

• k cycle delay?– bad, CPI is not part of ISA

• k instruction slot delay– load should not be followed by use of the value in

the next k instructions• Nothing, but code can reduce run-time

delays• MIPS did the transformation in the

assembler

CPSC614Lec 2.36

Historical Perspective: Microprogramming

MainMemory

executionunit

controlmemory

CPU

ADDSUBAND

DATA

.

.

.

User program plus Data

this can change!

one of these ismapped into oneof these

Supported complex instructions a sequence of simple micro-inst (RTs)Pipelined micro-instruction processing, but very limited view.Could not reorganize macroinstructions to enable pipelining

CPSC614Lec 2.37

Control Hazard on Branches=> Three Stage Stall

10: beq r1,r3,36

14: and r2,r3,r5

18: or r6,r1,r7

22: add r8,r1,r9

36: xor r10,r1,r11

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

CPSC614Lec 2.38

Example: Branch Stall Impact

• If 30% branch, Stall 3 cycles significant• Two part solution:

– Determine branch taken or not sooner, AND– Compute taken branch address earlier

• MIPS branch tests if register = 0 or 0• MIPS Solution:

– Move Zero test to ID/RF stage– Adder to calculate new PC in ID/RF stage– 1 clock cycle penalty for branch versus 3

CPSC614Lec 2.39

Adder

IF/ID

Pipelined MIPS DatapathFigure 3.22, page 163, CA:AQA 2/e

MemoryAccess

Write

Back

InstructionFetch


ExecuteAddr. Calc

ALU

Mem

ory

Reg File

MUX

DataM

emory

MUX

SignExtend

Zero?

MEM

/WB

EX/MEM

4

Adder

Next SEQ PC

RD RD RD WB

Data

• Data stationary control– local decode for each instruction phase / pipeline stage

Next PC

Address

RS1

RS2

ImmM

UX

ID/EX

CPSC614Lec 2.40

Four Branch Hazard Alternatives#1: Stall until branch direction is clear#2: Predict Branch Not Taken

– Execute successor instructions in sequence– “Squash” instructions in pipeline if branch actually taken– Advantage of late pipeline state update– 47% MIPS branches not taken on average– PC+4 already calculated, so use it to get next instruction

#3: Predict Branch Taken– 53% MIPS branches taken on average– But haven’t calculated branch target address in MIPS

» MIPS still incurs 1 cycle branch penalty» Other machines: branch target known before outcome

CPSC614Lec 2.41

Four Branch Hazard Alternatives

#4: Delayed Branch– Define branch to take place AFTER a following

instruction

branch instructionsequential successor1sequential successor2........sequential successorn

........branch target if taken

– 1 slot delay allows proper decision and branch target address in 5 stage pipeline

– MIPS uses this

Branch delay of length n

CPSC614Lec 2.42

Delayed Branch• Where to get instructions to fill branch delay slot?

– Before branch instruction– From the target address: only valuable when branch taken– From fall through: only valuable when branch not taken– Canceling branches allow more slots to be filled

• Compiler effectiveness for single branch delay slot:– Fills about 60% of branch delay slots– About 80% of instructions executed in branch delay slots useful in

computation– About 50% (60% x 80%) of slots usefully filled

• Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)

CPSC614Lec 2.43

Example: Evaluating Branch Alternatives

Assume: Conditional & Unconditional = 14%, 65% change PC

Scheduling BranchCPIspeedup v. scheme penaltystall

Stall pipeline 31.421.0Predict taken 11.141.26Predict not taken 11.091.29Delayed branch 0.51.071.31

Pipeline speedup = Pipeline depth1 +Branch frequencyBranch penalty

CPSC 614:Graduate Computer Architecture Hazards (Chap. 3)

Documents

Transcript of CPSC 614:Graduate Computer Architecture Hazards (Chap. 3)