CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture...

Post on 18-May-2018

213 views 0 download

Transcript of CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture...

CMSC 22200 Computer Architecture

Lecture 5: Pipelining: Data Dependency Handling

Prof. Yanjing Li University of Chicago

Administrative Stuff

2

!  Lab1 "  Aim to finish grading by Thursday

!  Lab2: due next Thursday (10/20) "  You should have started by now "  Fix Lab1 first! Use the “golden” simulator

!  And get help!

!  Lab2 and Lab3 late penalty waived "  Still need to hand in within 48 hours past due date

!  Need to reschedule lecture for next Tuesday (10/18)

Lecture Outline

3

!  Pipelining discussions

!  Non-ideal pipeline

!  Dependencies and how to handle them

Single Cycle uarch: Datapath & Control

4 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2017Elsevier.ALLRIGHTSRESERVED.]

Putting It All Together: R-Type ALU

5 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2017Elsevier.ALLRIGHTSRESERVED.]

ALUop

10

0

Putting It All Together: LDUR

6 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2017Elsevier.ALLRIGHTSRESERVED.]

ADD

10

1

Putting It All Together: STUR

7 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2017Elsevier.ALLRIGHTSRESERVED.]

ADD

01

0

Putting It All Together: B

8 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2017Elsevier.ALLRIGHTSRESERVED.]

X

00

0

Putting It All Together: CBZ (Taken)

9 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2017Elsevier.ALLRIGHTSRESERVED.]

cmp

00

0

Putting It All Together: CBZ (Not Taken)

10 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2017Elsevier.ALLRIGHTSRESERVED.]

cmp

00

0

Single Cycle uArch: Summary

!  Inefficient "  All instructions run as slow as the slowest instruction

!  Not necessarily the simplest way to implement an ISA "  Single-cycle implementation of REP MOVS (x86)?

!  Not easy to optimize/improve performance "  Optimizing the common case (e.g. common instructions) does not work "  Need to optimize the worst case all the time

!  All resources are not fully utilized "  e.g., data memory access can’t overlap with ALU operation

!  How to do better? 11

Single-Cycle, Multi-Cycle, Pipelining

!  Single-cycle: 1 cycle per instruction, long cycle time

!  Multi-cycle: 5 cycles per instruction, short cycle time

!  Pipeline: 1 cycle per instruction (steady state), short cycle time

12

Time

F D E M W F D E M W

F D E M W

F D E M W F D E M W

F D E M W F D E M W

F D E M W F D E M W

Adding Pipeline Registers !  Registers between stages to hold information produced in

previous cycle

IRD

PCD

PCE

nPC M

AE

B E

Imm

E

Aou

t M

B M

MDR W

Aou

t W

13 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2017Elsevier.ALLRIGHTSRESERVED.]

Corrected from last lecture: latching PC instead of PC+4

Pipelined Control

Identical set of control points as the single-cycle uarch!! 14 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT

2017Elsevier.ALLRIGHTSRESERVED.]

Pipelined uarch

15 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2017Elsevier.ALLRIGHTSRESERVED.]

Is this a good partitioning? Why not 4 or 6 stages? Why not different boundaries?

Performance Analysis

16

Terminologies and Definitions

!  CPI: cycle per instruction

!  IPC: instruction per cycle, which is 1/CPI

!  Execution time of an instruction "  {CPI} x {clock cycle time}

!  Execution time of an program "  Sum over all instructions [ {CPI} x {clock cycle time} ] "  {# of instructions} x {average CPI} x {clock cycle time}

17

Examples !  Remember: execution time of a program

"  Sum over all instructions [ {CPI} x {clock cycle time} ] "  {# of instructions} x {average CPI} x {clock cycle time}

!  Single-cycle uarch "  CPI = 1, but clock cycle time is long

!  Multi-cycle uarch (with 5 stages) "  CPI = 5, but clock cycle time is short

!  Pipelined uarch (with 5 stages) "  CPI = 1 (steady state), clock cycle time same with multi-cycle "  This is the ideal case

18

Pipelining: Discussions

19

Pipelined uarch

20 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2017Elsevier.ALLRIGHTSRESERVED.]

Is this a good partitioning? Why not 4 or 6 stages? Why not different boundaries?

Pipeline Considerations

!  How to partition?

!  How many stages?

21

Pipeline Partitioning: Resource Requirement

!  The goal: no shared resources among different pipeline stages "  i.e., No resource is used by more than 1 stage "  Otherwise, we have resource contention or structural hazard

!  Example: need to be able to fetch instructions (in IF stage) and load data (in MEM stage) at the same time "  Single memory interface not sufficient "  Solution 1: provide two separate interfaces via instruction and

data caches "  Solution 2:??

22

How Many Pipeline Stages? !  BW (bandwidth), a.k.a. throughput: 1/cycle time !  Ideally, sequential elements (pipeline registers) do no

impose additional delays/cost

23

combinaRonallogic(F,D,E,M,W)Tpse

BW=~(1/T)

BW=~(2/T)T/2ps(F,D,E) T/2ps(M,W)

BW=~(3/T)T/3ps(F,D)

T/3ps(E,M)

T/3ps(M,W)

!  NonpipelinedversionwithdelayT

BW=1/(T+S)whereS=sequenRalelementdelay

!  k-stagepipelinedversion BWk-stage=1/(T/k+S)

BWmax=1/(1gatedelay+S)

24

Tps

T/kps

T/kps

Sequential element delay reduces BW (switching overhead b/w stages)

How Many Pipeline Stages?

How Many Pipeline Stages? !  NonpipelinedversionwithcombinaRonalcostG

Cost=G+LwhereL=sequenRalelementcost

!  k-stagepipelinedversion Costk-stage~=G+Lk

25

Ggates

G/k G/k

Sequential elements increase hardware cost

It is critical to balance the tradeoffs i.e., how many stages and what is done in each stage

Properties of An Ideal Pipeline !  Goal: Increase throughput with little increase in cost

(hardware cost, in case of instruction processing)

!  Repetition of identical operations "  The same operation is repeated on a large number of different

inputs (e.g., all laundry loads go through the same steps)

!  Repetition of independent operations "  No dependencies between repeated operations

!  Uniformly partitionable suboperations "  Processing an be evenly divided into uniform-latency

suboperations (that do not share resources)

!  Can you implement an ideal pipeline for instruction processing?

26

Instruction Pipeline: Not Ideal !  Identical operations ... NOT!

⇒ different instructions # not all need the same stages Forcing different instructions to go through the same pipe stages # external fragmentation (some pipe stages idle for some instructions)

!  Uniform suboperations ... NOT! ⇒ different pipeline stages # not the same latency

Need to force each stage to be controlled by the same clock # internal fragmentation (some pipe stages are too fast but all take the same clock cycle time)

!  Independent operations ... NOT! ⇒ instructions are not independent of each other

Need to detect and resolve inter-instruction dependencies to ensure the pipeline provides correct results # pipeline stalls (pipeline is not always moving)

27

Instruction Pipeline: Not Ideal !  Identical operations ... NOT!

⇒ different instructions # not all need the same stages Forcing different instructions to go through the same pipe stages # external fragmentation (some pipe stages idle for some instructions)

!  Examples "  Add, Branch: no need to go through the MEM stage "  Others?

!  Performance impact?

28

Instruction Pipeline: Not Ideal

!  Uniform suboperations ... NOT! ⇒ different pipeline stages # not the same latency

Need to force each stage to be controlled by the same clock # internal fragmentation (some pipe stages are too fast but all take the same clock cycle time)

29

Non-Uniform Operations: Laundry Analogy

30

the slowest step decides throughput

Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2004Elsevier.ALLRIGHTSRESERVED.]

Non-Uniform Operations: Example

31

200ps 100ps 200ps 200ps 100ps

Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2004Elsevier.ALLRIGHTSRESERVED.]

IRD

PCD

PCE

nPC M

AE

B E

Imm

E

Aou

t M

B M

MDR W

Aou

t W

32

20040060080010001200140016001800

200400600800100012001400

800ps

800ps

800ps

200ps200ps200ps200ps200ps

200ps

200ps

Non-Uniform Operations: Example

Instruction Pipeline: Not Ideal

!  Independent operations ... NOT! ⇒ instructions are not independent of each other

Need to detect and resolve inter-instruction dependencies to ensure the pipeline provides correct results # pipeline stalls (pipeline is not always moving)

33

Dependencies and Their Types !  Also called “hazards”

!  Two types "  Data dependency "  Control dependency

34

Data Dependency Handling

35

Data Dependency Types

36

Flowdependencyr3←r1opr2Read-afer-Write(RAW)r5 ←r3opr4

AnRdependencyr5 ←r3opr4 Write-afer-Read(WAR)r3 ←r6opr7

Output-dependencyr3 ←r1opr2 Write-afer-Write(WAW)r5 ←r3opr4 r3 ←r6opr7

Data Dependency Types !  Flow dependencies always need to be obeyed because they

constitute true dependence on a value

!  Anti and output dependencies exist due to limited number of architectural registers "  They are dependence on a name, not a value

!  Anti and output dependences are easier to handle "  Write to the destination in one stage and in program order

!  Flow dependences are more interesting

37

Ways of Handling Flow Dependencies

!  Detect and wait until value is available in register file !  Detect and forward/bypass data to dependent instruction !  Detect and eliminate the dependence at the software level

"  No need for the hardware to detect dependence

!  Predict the needed value(s), execute “speculatively”, and verify

!  Do something else (fine-grained multithreading) "  No need to detect

38

Flow Dependency Example !  Consider this sequence:

SUB X2, X1,X3 AND X12,X2,X5 OR X13,X2,X6 ADD X14,X2,X2 STUR X15,[X2,#100]

!  SUB writing to X2 and ADD reading from it in the same cycle !  Assume “internal forwarding” in register file

"  i.e., ADD gets the new X2 value produced from SUB

40

MEM

WBIF ID

IF

EX

ID

MEM

EX WB

SUB X2,X1,X3

AND X12,X2,X5

MEMIF ID EX

IF ID EX

IF ID

OR X13,X2,X6

ADD X14,X2,X2

STUR X15,[X2,#100]

?

Flow Dependency Example Time

How to Detect Flow Dependencies in HW?

!  Instructions IA and IB (where IA comes before IB) have RAW dependency iff "  IB (R/I, LDUR, or STUR) reads a register written by IA (R/I or

LDUR) and "  dist(IA, IB) < dist(ID, WB) = 3

41

R/I-Type LDUR STUR B

IF

ID readRF readRF readRF

EX

MEM

WB writeRF writeRF

Flow Dependency Check Logic !  Helper functions

"  Op1(I) and Op2(I) returns the 1st and 2nd register operand field of I, respectively

"  Use_Op1(I) returns true if I requires the 1st register operands and the register is not X31; similarly for Use_Op2(I)

!  Flow dependency occurs when "  (Op1(IRID)==destEX) && use_Op1(IRID) && RegWriteEX or "  (Op1(IRID)==destMEM) && use_Op1(IRID) && RegWriteMEM or "  (Op2(IRID)==destEX) && use_Op2(IRID) && RegWriteEX or "  (Op2(IRID)==destMEM) && use_Op2(IRID) && RegWriteMEM

42

Resolving Data Dependence

43

IF

WB

IF ID ALU MEMIF ID ALU MEM

IF ID ALU MEMIF ID ALU

t0 t1 t2 t3 t4 t5

IF ID MEMIF ID ALU

IF ID

InstiInstjInstkInstl

WBWB

i:rx←_j:_←rx dist(i,j)=1

i

j

Insth

WBMEMALU

i:rx←_bubblej:_←rx dist(i,j)=2

WB

IF ID ALU MEMIF ID ALU MEM

IF ID ALU MEMIF ID ALU

t0 t1 t2 t3 t4 t5

MEM

InstiInstjInstkInstl

WBWBi

j

Insth

IDIF

IF

IF ID ALUIF ID

i:rx←_bubblebubblej:_←rx dist(i,j)=3

IF ID ALU MEMIF ID ALU MEM

IF ID ALUIF ID

t0 t1 t2 t3 t4 t5

IF

MEMALUID

InstiInstjInstkInstl

WBWBi

j

Insth

IDIF

IDIF

ID

IF ID ALU MEMIF ID ALU

IF IDIF

t0 t1 t2 t3 t4 t5

MEMALU

InstiInstjInstkInstl

WBi

j

Insth

IFALUID

Stall = make the dependent instruction wait until its source data value is available

1. stop all up-stream stages 2. drain all down-stream stages

Option 1: Stall the pipeline (i.e., Inserting “bubbles”)

Resolving Data Dependence

44

IF

WB

IF ID ALU MEMIF ID ALU MEM

IF ID ALU MEMIF ID ALU

t0 t1 t2 t3 t4 t5

IF ID MEMIF ID ALU

IF ID

InstiInstjInstkInstl

WBWB

i:rx←_j:_←rx dist(i,j)=1

i

j

Insth

WBMEMALU

i:rx←_bubblej:_←rx dist(i,j)=2

WB

IF ID ALU MEMIF ID ALU MEM

IF ID ALU MEMIF ID ALU

t0 t1 t2 t3 t4 t5

MEM

InstiInstjInstkInstl

WBWBi

j

Insth

IDIF

IF

IF ID ALUIF ID

IF ID ALU MEMIF ID ALU MEM

IF IDIF

t0 t1 t2 t3 t4 t5

ALU ID

InstiInstjInstkInstl

WBWB

Insth

IFMEM

IF ID ALU MEMIF ID ALU

IF IDj

IFk

t0 t1 t2 t3 t4 t5

InstiBubble(nop)Bubble(nop)Instj

WBi

Insth

IDj

Option 1: Stall the pipeline (i.e., Inserting “bubbles”)

IDjIFkIFk ID ALUInstk

j

k

How to Stall?

!  Prevent update of PC and IF/ID registers # ensure stalled instructions stay in their stages "  Use a “write enable” signal for

the register

!  Force control values in ID/EX registers to 0 # nop (no-operation)

!  It is crucial that the EX, MEM and WB stages continue to advance normally during stall cycles

[BasedonoriginalfigurefromP&HCO&D,COPYRIGHT2016Elsevier.ALLRIGHTSRESERVED.]

Register with Write Control !  Only updates on clock edge when write control input is 1 !  Used when stored value is required later

D

Clk

Q Write

Write

D

Q

Clk

[BasedonoriginalfigurefromP&HCO&D,COPYRIGHT2016Elsevier.ALLRIGHTSRESERVED.]

Impact of Stall on Performance !  Each stall cycle corresponds to one lost cycle in which no

instruction can be completed

!  For a program with N instructions and S stall cycles, Average CPI=(N+S)/N

!  S depends on "  frequency of RAW dependences "  exact distance between the dependent instructions "  distance between dependencies

suppose i1,i2 and i3 all depend on i0, once i1’s dependence is resolved, i2 and i3 must be okay too

47

Reducing Stalls with Data Forwarding !  Also called Data Bypassing

!  Forward the value to the dependent instruction as soon as it is available

48

ID IDIF ID

WBIF ID EX MEMSUB X2,X1,X3

AND X12,X2,X5 MEMIF EX WB

!  Option 2: data forwarding / bypassing

!  Instructions IA and IB (where IA comes before IB) have flow dependency"  i.e.,if IB in ID stage reads a register written by IA in EX or MEM

stage, then the operand required by IB is not yet in RF

⇒ retrieve operand from datapath instead of the RF

49

Resolving Data Dependence

Double Data Dependency !  Consider the sequence:

add X1,X1,X2 and X1,X1,X3 sub X1,X1,X4

retrieve operand from the younger instruction if data dependencies occur in multiple outstanding instructions

MEM

WBIF ID

IF

EX

ID

MEM

EX WB

ADD X1,X1,X2

AND X1,X1,X3

MEMIF ID EXSUB X1,X1,X4

Incorrect forwarding

Correct forwarding

Datapath with Forwarding

[BasedonoriginalfigurefromP&HCO&D,COPYRIGHT2016Elsevier.ALLRIGHTSRESERVED.]

Forwarding Paths

dist(i,j)=1dist(i,j)=2

dist(i,j)=3

[BasedonoriginalfigurefromP&HCO&D,COPYRIGHT2016Elsevier.ALLRIGHTSRESERVED.]

Forwarding Conditions and Logic Mux control Source Explanation ForwardA = 00 ID/EX First operand comes from register file.

ForwardA = 10 EX/MEM First operand is forwarded from ALU result of the previous instruction.

ForwardA = 01 MEM/WB First operand is forwarded from data memory or an earlier ALU result.

if (Op1EX!=X31) && (Op1EX==destMEM) && RegWriteMEM then forward operand from EX/MEM stage // dist=1

else if (Op1EX!=X31) && (Op1EX==destWB) && RegWriteWB then forward operand from MEM/WB stage // dist=2

else use operand from register file // dist >= 3

Ordering matters!! Must check youngest match first

What does the above not take into account?

Similar for ForwardB

Data Forwarding (Dependency Analysis)

!  Even with data-forwarding, RAW dependence on an immediately preceding LDUR instruction, this is called load-use dependency

!  Requires a stall

54

R/I-Type LDUR STUR B

IF

ID

EXuse

produce use (use)

MEM produceActually

use

WB

Load-Use Data Dependency Example !  Consider this sequence:

LDUR X2, [X1,#20] AND X4,X2,X5 OR X8,X2,X6

MEM

WBIF ID

IF

EX

ID

MEM

EX WB

LDUR X2,[X1,#20]

AND X4,X2,X5

MEMIF ID EXOR X8,X2,X6

Incorrect forwarding

Load-Use Data Dependency Example

Bubble inserted here

[BasedonoriginalfigurefromP&HCO&D,COPYRIGHT2016Elsevier.ALLRIGHTSRESERVED.]

Load-Use Dependency Detection

!  Load-use dependency when "  (OpcodeEX == LDUR) and

((destEX == Op1ID) or (destEX == Op2ID)) and (destEX != X31)

!  If detected, stall and insert bubble

!  What if the instruction following LDUR is STUR?

Pipeline with Data Dependency Handling

[BasedonoriginalfigurefromP&HCO&D,COPYRIGHT2016Elsevier.ALLRIGHTSRESERVED.]

Load-use dependency

detection

More Optimization Opportunities

59

!  Consider this sequence !  Requires 1 stall LDUR X2, [X1,#20] AND X4,X2,X5 OR X8,X3,X6

!  Why not re-order the instructions, as long as program semantic is preserved? !  No need to stall LDUR X2, [X1,#20] OR X8,X3,X6

AND X4,X2,X5

HW vs. SW in Dependency Handling !  Software-based vs. hardware-based interlocking

"  Who inserts/manages the pipeline bubbles? "  Who finds the independent instructions and re-orders

instructions to fill “empty” pipeline slots?

!  What are the advantages/disadvantages of each?

60

!  Software based scheduling of instructions # static scheduling "  Compiler orders the instructions, hardware executes them in

that order "  Contrast this with dynamic scheduling (in which hardware can

execute instructions out of the compiler-specified order) "  How does the compiler know the latency of each instruction?

!  What information does the compiler not know that makes static scheduling difficult? "  Answer: Anything that is determined at run time

!  Variable-length operation latency, branch direction

!  How can the compiler alleviate this (i.e., estimate the unknown)? "  Answer: Profiling

61

HW vs. SW in Dependency Handling