CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture...

61
CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Transcript of CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture...

Page 1: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

CMSC 22200 Computer Architecture

Lecture 5: Pipelining: Data Dependency Handling

Prof. Yanjing Li University of Chicago

Page 2: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Administrative Stuff

2

!  Lab1 "  Aim to finish grading by Thursday

!  Lab2: due next Thursday (10/20) "  You should have started by now "  Fix Lab1 first! Use the “golden” simulator

!  And get help!

!  Lab2 and Lab3 late penalty waived "  Still need to hand in within 48 hours past due date

!  Need to reschedule lecture for next Tuesday (10/18)

Page 3: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Lecture Outline

3

!  Pipelining discussions

!  Non-ideal pipeline

!  Dependencies and how to handle them

Page 4: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Single Cycle uarch: Datapath & Control

4 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2017Elsevier.ALLRIGHTSRESERVED.]

Page 5: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Putting It All Together: R-Type ALU

5 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2017Elsevier.ALLRIGHTSRESERVED.]

ALUop

10

0

Page 6: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Putting It All Together: LDUR

6 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2017Elsevier.ALLRIGHTSRESERVED.]

ADD

10

1

Page 7: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Putting It All Together: STUR

7 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2017Elsevier.ALLRIGHTSRESERVED.]

ADD

01

0

Page 8: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Putting It All Together: B

8 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2017Elsevier.ALLRIGHTSRESERVED.]

X

00

0

Page 9: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Putting It All Together: CBZ (Taken)

9 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2017Elsevier.ALLRIGHTSRESERVED.]

cmp

00

0

Page 10: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Putting It All Together: CBZ (Not Taken)

10 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2017Elsevier.ALLRIGHTSRESERVED.]

cmp

00

0

Page 11: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Single Cycle uArch: Summary

!  Inefficient "  All instructions run as slow as the slowest instruction

!  Not necessarily the simplest way to implement an ISA "  Single-cycle implementation of REP MOVS (x86)?

!  Not easy to optimize/improve performance "  Optimizing the common case (e.g. common instructions) does not work "  Need to optimize the worst case all the time

!  All resources are not fully utilized "  e.g., data memory access can’t overlap with ALU operation

!  How to do better? 11

Page 12: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Single-Cycle, Multi-Cycle, Pipelining

!  Single-cycle: 1 cycle per instruction, long cycle time

!  Multi-cycle: 5 cycles per instruction, short cycle time

!  Pipeline: 1 cycle per instruction (steady state), short cycle time

12

Time

F D E M W F D E M W

F D E M W

F D E M W F D E M W

F D E M W F D E M W

F D E M W F D E M W

Page 13: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Adding Pipeline Registers !  Registers between stages to hold information produced in

previous cycle

IRD

PCD

PCE

nPC M

AE

B E

Imm

E

Aou

t M

B M

MDR W

Aou

t W

13 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2017Elsevier.ALLRIGHTSRESERVED.]

Corrected from last lecture: latching PC instead of PC+4

Page 14: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Pipelined Control

Identical set of control points as the single-cycle uarch!! 14 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT

2017Elsevier.ALLRIGHTSRESERVED.]

Page 15: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Pipelined uarch

15 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2017Elsevier.ALLRIGHTSRESERVED.]

Is this a good partitioning? Why not 4 or 6 stages? Why not different boundaries?

Page 16: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Performance Analysis

16

Page 17: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Terminologies and Definitions

!  CPI: cycle per instruction

!  IPC: instruction per cycle, which is 1/CPI

!  Execution time of an instruction "  {CPI} x {clock cycle time}

!  Execution time of an program "  Sum over all instructions [ {CPI} x {clock cycle time} ] "  {# of instructions} x {average CPI} x {clock cycle time}

17

Page 18: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Examples !  Remember: execution time of a program

"  Sum over all instructions [ {CPI} x {clock cycle time} ] "  {# of instructions} x {average CPI} x {clock cycle time}

!  Single-cycle uarch "  CPI = 1, but clock cycle time is long

!  Multi-cycle uarch (with 5 stages) "  CPI = 5, but clock cycle time is short

!  Pipelined uarch (with 5 stages) "  CPI = 1 (steady state), clock cycle time same with multi-cycle "  This is the ideal case

18

Page 19: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Pipelining: Discussions

19

Page 20: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Pipelined uarch

20 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2017Elsevier.ALLRIGHTSRESERVED.]

Is this a good partitioning? Why not 4 or 6 stages? Why not different boundaries?

Page 21: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Pipeline Considerations

!  How to partition?

!  How many stages?

21

Page 22: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Pipeline Partitioning: Resource Requirement

!  The goal: no shared resources among different pipeline stages "  i.e., No resource is used by more than 1 stage "  Otherwise, we have resource contention or structural hazard

!  Example: need to be able to fetch instructions (in IF stage) and load data (in MEM stage) at the same time "  Single memory interface not sufficient "  Solution 1: provide two separate interfaces via instruction and

data caches "  Solution 2:??

22

Page 23: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

How Many Pipeline Stages? !  BW (bandwidth), a.k.a. throughput: 1/cycle time !  Ideally, sequential elements (pipeline registers) do no

impose additional delays/cost

23

combinaRonallogic(F,D,E,M,W)Tpse

BW=~(1/T)

BW=~(2/T)T/2ps(F,D,E) T/2ps(M,W)

BW=~(3/T)T/3ps(F,D)

T/3ps(E,M)

T/3ps(M,W)

Page 24: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

!  NonpipelinedversionwithdelayT

BW=1/(T+S)whereS=sequenRalelementdelay

!  k-stagepipelinedversion BWk-stage=1/(T/k+S)

BWmax=1/(1gatedelay+S)

24

Tps

T/kps

T/kps

Sequential element delay reduces BW (switching overhead b/w stages)

How Many Pipeline Stages?

Page 25: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

How Many Pipeline Stages? !  NonpipelinedversionwithcombinaRonalcostG

Cost=G+LwhereL=sequenRalelementcost

!  k-stagepipelinedversion Costk-stage~=G+Lk

25

Ggates

G/k G/k

Sequential elements increase hardware cost

It is critical to balance the tradeoffs i.e., how many stages and what is done in each stage

Page 26: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Properties of An Ideal Pipeline !  Goal: Increase throughput with little increase in cost

(hardware cost, in case of instruction processing)

!  Repetition of identical operations "  The same operation is repeated on a large number of different

inputs (e.g., all laundry loads go through the same steps)

!  Repetition of independent operations "  No dependencies between repeated operations

!  Uniformly partitionable suboperations "  Processing an be evenly divided into uniform-latency

suboperations (that do not share resources)

!  Can you implement an ideal pipeline for instruction processing?

26

Page 27: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Instruction Pipeline: Not Ideal !  Identical operations ... NOT!

⇒ different instructions # not all need the same stages Forcing different instructions to go through the same pipe stages # external fragmentation (some pipe stages idle for some instructions)

!  Uniform suboperations ... NOT! ⇒ different pipeline stages # not the same latency

Need to force each stage to be controlled by the same clock # internal fragmentation (some pipe stages are too fast but all take the same clock cycle time)

!  Independent operations ... NOT! ⇒ instructions are not independent of each other

Need to detect and resolve inter-instruction dependencies to ensure the pipeline provides correct results # pipeline stalls (pipeline is not always moving)

27

Page 28: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Instruction Pipeline: Not Ideal !  Identical operations ... NOT!

⇒ different instructions # not all need the same stages Forcing different instructions to go through the same pipe stages # external fragmentation (some pipe stages idle for some instructions)

!  Examples "  Add, Branch: no need to go through the MEM stage "  Others?

!  Performance impact?

28

Page 29: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Instruction Pipeline: Not Ideal

!  Uniform suboperations ... NOT! ⇒ different pipeline stages # not the same latency

Need to force each stage to be controlled by the same clock # internal fragmentation (some pipe stages are too fast but all take the same clock cycle time)

29

Page 30: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Non-Uniform Operations: Laundry Analogy

30

the slowest step decides throughput

Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2004Elsevier.ALLRIGHTSRESERVED.]

Page 31: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Non-Uniform Operations: Example

31

200ps 100ps 200ps 200ps 100ps

Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2004Elsevier.ALLRIGHTSRESERVED.]

IRD

PCD

PCE

nPC M

AE

B E

Imm

E

Aou

t M

B M

MDR W

Aou

t W

Page 32: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

32

20040060080010001200140016001800

200400600800100012001400

800ps

800ps

800ps

200ps200ps200ps200ps200ps

200ps

200ps

Non-Uniform Operations: Example

Page 33: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Instruction Pipeline: Not Ideal

!  Independent operations ... NOT! ⇒ instructions are not independent of each other

Need to detect and resolve inter-instruction dependencies to ensure the pipeline provides correct results # pipeline stalls (pipeline is not always moving)

33

Page 34: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Dependencies and Their Types !  Also called “hazards”

!  Two types "  Data dependency "  Control dependency

34

Page 35: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Data Dependency Handling

35

Page 36: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Data Dependency Types

36

Flowdependencyr3←r1opr2Read-afer-Write(RAW)r5 ←r3opr4

AnRdependencyr5 ←r3opr4 Write-afer-Read(WAR)r3 ←r6opr7

Output-dependencyr3 ←r1opr2 Write-afer-Write(WAW)r5 ←r3opr4 r3 ←r6opr7

Page 37: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Data Dependency Types !  Flow dependencies always need to be obeyed because they

constitute true dependence on a value

!  Anti and output dependencies exist due to limited number of architectural registers "  They are dependence on a name, not a value

!  Anti and output dependences are easier to handle "  Write to the destination in one stage and in program order

!  Flow dependences are more interesting

37

Page 38: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Ways of Handling Flow Dependencies

!  Detect and wait until value is available in register file !  Detect and forward/bypass data to dependent instruction !  Detect and eliminate the dependence at the software level

"  No need for the hardware to detect dependence

!  Predict the needed value(s), execute “speculatively”, and verify

!  Do something else (fine-grained multithreading) "  No need to detect

38

Page 39: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Flow Dependency Example !  Consider this sequence:

SUB X2, X1,X3 AND X12,X2,X5 OR X13,X2,X6 ADD X14,X2,X2 STUR X15,[X2,#100]

Page 40: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

!  SUB writing to X2 and ADD reading from it in the same cycle !  Assume “internal forwarding” in register file

"  i.e., ADD gets the new X2 value produced from SUB

40

MEM

WBIF ID

IF

EX

ID

MEM

EX WB

SUB X2,X1,X3

AND X12,X2,X5

MEMIF ID EX

IF ID EX

IF ID

OR X13,X2,X6

ADD X14,X2,X2

STUR X15,[X2,#100]

?

Flow Dependency Example Time

Page 41: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

How to Detect Flow Dependencies in HW?

!  Instructions IA and IB (where IA comes before IB) have RAW dependency iff "  IB (R/I, LDUR, or STUR) reads a register written by IA (R/I or

LDUR) and "  dist(IA, IB) < dist(ID, WB) = 3

41

R/I-Type LDUR STUR B

IF

ID readRF readRF readRF

EX

MEM

WB writeRF writeRF

Page 42: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Flow Dependency Check Logic !  Helper functions

"  Op1(I) and Op2(I) returns the 1st and 2nd register operand field of I, respectively

"  Use_Op1(I) returns true if I requires the 1st register operands and the register is not X31; similarly for Use_Op2(I)

!  Flow dependency occurs when "  (Op1(IRID)==destEX) && use_Op1(IRID) && RegWriteEX or "  (Op1(IRID)==destMEM) && use_Op1(IRID) && RegWriteMEM or "  (Op2(IRID)==destEX) && use_Op2(IRID) && RegWriteEX or "  (Op2(IRID)==destMEM) && use_Op2(IRID) && RegWriteMEM

42

Page 43: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Resolving Data Dependence

43

IF

WB

IF ID ALU MEMIF ID ALU MEM

IF ID ALU MEMIF ID ALU

t0 t1 t2 t3 t4 t5

IF ID MEMIF ID ALU

IF ID

InstiInstjInstkInstl

WBWB

i:rx←_j:_←rx dist(i,j)=1

i

j

Insth

WBMEMALU

i:rx←_bubblej:_←rx dist(i,j)=2

WB

IF ID ALU MEMIF ID ALU MEM

IF ID ALU MEMIF ID ALU

t0 t1 t2 t3 t4 t5

MEM

InstiInstjInstkInstl

WBWBi

j

Insth

IDIF

IF

IF ID ALUIF ID

i:rx←_bubblebubblej:_←rx dist(i,j)=3

IF ID ALU MEMIF ID ALU MEM

IF ID ALUIF ID

t0 t1 t2 t3 t4 t5

IF

MEMALUID

InstiInstjInstkInstl

WBWBi

j

Insth

IDIF

IDIF

ID

IF ID ALU MEMIF ID ALU

IF IDIF

t0 t1 t2 t3 t4 t5

MEMALU

InstiInstjInstkInstl

WBi

j

Insth

IFALUID

Stall = make the dependent instruction wait until its source data value is available

1. stop all up-stream stages 2. drain all down-stream stages

Option 1: Stall the pipeline (i.e., Inserting “bubbles”)

Page 44: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Resolving Data Dependence

44

IF

WB

IF ID ALU MEMIF ID ALU MEM

IF ID ALU MEMIF ID ALU

t0 t1 t2 t3 t4 t5

IF ID MEMIF ID ALU

IF ID

InstiInstjInstkInstl

WBWB

i:rx←_j:_←rx dist(i,j)=1

i

j

Insth

WBMEMALU

i:rx←_bubblej:_←rx dist(i,j)=2

WB

IF ID ALU MEMIF ID ALU MEM

IF ID ALU MEMIF ID ALU

t0 t1 t2 t3 t4 t5

MEM

InstiInstjInstkInstl

WBWBi

j

Insth

IDIF

IF

IF ID ALUIF ID

IF ID ALU MEMIF ID ALU MEM

IF IDIF

t0 t1 t2 t3 t4 t5

ALU ID

InstiInstjInstkInstl

WBWB

Insth

IFMEM

IF ID ALU MEMIF ID ALU

IF IDj

IFk

t0 t1 t2 t3 t4 t5

InstiBubble(nop)Bubble(nop)Instj

WBi

Insth

IDj

Option 1: Stall the pipeline (i.e., Inserting “bubbles”)

IDjIFkIFk ID ALUInstk

j

k

Page 45: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

How to Stall?

!  Prevent update of PC and IF/ID registers # ensure stalled instructions stay in their stages "  Use a “write enable” signal for

the register

!  Force control values in ID/EX registers to 0 # nop (no-operation)

!  It is crucial that the EX, MEM and WB stages continue to advance normally during stall cycles

[BasedonoriginalfigurefromP&HCO&D,COPYRIGHT2016Elsevier.ALLRIGHTSRESERVED.]

Page 46: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Register with Write Control !  Only updates on clock edge when write control input is 1 !  Used when stored value is required later

D

Clk

Q Write

Write

D

Q

Clk

[BasedonoriginalfigurefromP&HCO&D,COPYRIGHT2016Elsevier.ALLRIGHTSRESERVED.]

Page 47: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Impact of Stall on Performance !  Each stall cycle corresponds to one lost cycle in which no

instruction can be completed

!  For a program with N instructions and S stall cycles, Average CPI=(N+S)/N

!  S depends on "  frequency of RAW dependences "  exact distance between the dependent instructions "  distance between dependencies

suppose i1,i2 and i3 all depend on i0, once i1’s dependence is resolved, i2 and i3 must be okay too

47

Page 48: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Reducing Stalls with Data Forwarding !  Also called Data Bypassing

!  Forward the value to the dependent instruction as soon as it is available

48

ID IDIF ID

WBIF ID EX MEMSUB X2,X1,X3

AND X12,X2,X5 MEMIF EX WB

Page 49: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

!  Option 2: data forwarding / bypassing

!  Instructions IA and IB (where IA comes before IB) have flow dependency"  i.e.,if IB in ID stage reads a register written by IA in EX or MEM

stage, then the operand required by IB is not yet in RF

⇒ retrieve operand from datapath instead of the RF

49

Resolving Data Dependence

Page 50: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Double Data Dependency !  Consider the sequence:

add X1,X1,X2 and X1,X1,X3 sub X1,X1,X4

retrieve operand from the younger instruction if data dependencies occur in multiple outstanding instructions

MEM

WBIF ID

IF

EX

ID

MEM

EX WB

ADD X1,X1,X2

AND X1,X1,X3

MEMIF ID EXSUB X1,X1,X4

Incorrect forwarding

Correct forwarding

Page 51: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Datapath with Forwarding

[BasedonoriginalfigurefromP&HCO&D,COPYRIGHT2016Elsevier.ALLRIGHTSRESERVED.]

Page 52: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Forwarding Paths

dist(i,j)=1dist(i,j)=2

dist(i,j)=3

[BasedonoriginalfigurefromP&HCO&D,COPYRIGHT2016Elsevier.ALLRIGHTSRESERVED.]

Page 53: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Forwarding Conditions and Logic Mux control Source Explanation ForwardA = 00 ID/EX First operand comes from register file.

ForwardA = 10 EX/MEM First operand is forwarded from ALU result of the previous instruction.

ForwardA = 01 MEM/WB First operand is forwarded from data memory or an earlier ALU result.

if (Op1EX!=X31) && (Op1EX==destMEM) && RegWriteMEM then forward operand from EX/MEM stage // dist=1

else if (Op1EX!=X31) && (Op1EX==destWB) && RegWriteWB then forward operand from MEM/WB stage // dist=2

else use operand from register file // dist >= 3

Ordering matters!! Must check youngest match first

What does the above not take into account?

Similar for ForwardB

Page 54: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Data Forwarding (Dependency Analysis)

!  Even with data-forwarding, RAW dependence on an immediately preceding LDUR instruction, this is called load-use dependency

!  Requires a stall

54

R/I-Type LDUR STUR B

IF

ID

EXuse

produce use (use)

MEM produceActually

use

WB

Page 55: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Load-Use Data Dependency Example !  Consider this sequence:

LDUR X2, [X1,#20] AND X4,X2,X5 OR X8,X2,X6

MEM

WBIF ID

IF

EX

ID

MEM

EX WB

LDUR X2,[X1,#20]

AND X4,X2,X5

MEMIF ID EXOR X8,X2,X6

Incorrect forwarding

Page 56: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Load-Use Data Dependency Example

Bubble inserted here

[BasedonoriginalfigurefromP&HCO&D,COPYRIGHT2016Elsevier.ALLRIGHTSRESERVED.]

Page 57: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Load-Use Dependency Detection

!  Load-use dependency when "  (OpcodeEX == LDUR) and

((destEX == Op1ID) or (destEX == Op2ID)) and (destEX != X31)

!  If detected, stall and insert bubble

!  What if the instruction following LDUR is STUR?

Page 58: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

Pipeline with Data Dependency Handling

[BasedonoriginalfigurefromP&HCO&D,COPYRIGHT2016Elsevier.ALLRIGHTSRESERVED.]

Load-use dependency

detection

Page 59: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

More Optimization Opportunities

59

!  Consider this sequence !  Requires 1 stall LDUR X2, [X1,#20] AND X4,X2,X5 OR X8,X3,X6

!  Why not re-order the instructions, as long as program semantic is preserved? !  No need to stall LDUR X2, [X1,#20] OR X8,X3,X6

AND X4,X2,X5

Page 60: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

HW vs. SW in Dependency Handling !  Software-based vs. hardware-based interlocking

"  Who inserts/manages the pipeline bubbles? "  Who finds the independent instructions and re-orders

instructions to fill “empty” pipeline slots?

!  What are the advantages/disadvantages of each?

60

Page 61: CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture Lecture 5: Pipelining: Data Dependency Handling Prof. Yanjing Li University of Chicago

!  Software based scheduling of instructions # static scheduling "  Compiler orders the instructions, hardware executes them in

that order "  Contrast this with dynamic scheduling (in which hardware can

execute instructions out of the compiler-specified order) "  How does the compiler know the latency of each instruction?

!  What information does the compiler not know that makes static scheduling difficult? "  Answer: Anything that is determined at run time

!  Variable-length operation latency, branch direction

!  How can the compiler alleviate this (i.e., estimate the unknown)? "  Answer: Profiling

61

HW vs. SW in Dependency Handling