CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture...

CMSC 22200 Computer Architecture

Lecture 5: Pipelining: Data Dependency Handling

Prof. Yanjing Li University of Chicago

Administrative Stuff

!  Lab1 "  Aim to finish grading by Thursday

!  Lab2: due next Thursday (10/20) "  You should have started by now "  Fix Lab1 first! Use the “golden” simulator

!  And get help!

!  Lab2 and Lab3 late penalty waived "  Still need to hand in within 48 hours past due date

!  Need to reschedule lecture for next Tuesday (10/18)

Lecture Outline

!  Pipelining discussions

!  Non-ideal pipeline

!  Dependencies and how to handle them

Single Cycle uarch: Datapath & Control

Putting It All Together: R-Type ALU

Putting It All Together: LDUR

Putting It All Together: STUR

Putting It All Together: B

Putting It All Together: CBZ (Taken)

Putting It All Together: CBZ (Not Taken)

Single Cycle uArch: Summary

!  Inefficient "  All instructions run as slow as the slowest instruction

!  Not necessarily the simplest way to implement an ISA "  Single-cycle implementation of REP MOVS (x86)?

!  Not easy to optimize/improve performance "  Optimizing the common case (e.g. common instructions) does not work "  Need to optimize the worst case all the time

!  All resources are not fully utilized "  e.g., data memory access can’t overlap with ALU operation

!  How to do better? 11

Single-Cycle, Multi-Cycle, Pipelining

!  Single-cycle: 1 cycle per instruction, long cycle time

!  Multi-cycle: 5 cycles per instruction, short cycle time

!  Pipeline: 1 cycle per instruction (steady state), short cycle time

F D E M W F D E M W

F D E M W

F D E M W F D E M W

Adding Pipeline Registers !  Registers between stages to hold information produced in

previous cycle

Corrected from last lecture: latching PC instead of PC+4

Pipelined Control

Identical set of control points as the single-cycle uarch!! 14 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT

2017Elsevier.ALLRIGHTSRESERVED.]

Pipelined uarch

Is this a good partitioning? Why not 4 or 6 stages? Why not different boundaries?

Performance Analysis

Terminologies and Definitions

!  CPI: cycle per instruction

!  IPC: instruction per cycle, which is 1/CPI

!  Execution time of an instruction "  {CPI} x {clock cycle time}

!  Execution time of an program "  Sum over all instructions [ {CPI} x {clock cycle time} ] "  {# of instructions} x {average CPI} x {clock cycle time}

Examples !  Remember: execution time of a program

"  Sum over all instructions [ {CPI} x {clock cycle time} ] "  {# of instructions} x {average CPI} x {clock cycle time}

!  Single-cycle uarch "  CPI = 1, but clock cycle time is long

!  Multi-cycle uarch (with 5 stages) "  CPI = 5, but clock cycle time is short

!  Pipelined uarch (with 5 stages) "  CPI = 1 (steady state), clock cycle time same with multi-cycle "  This is the ideal case

Pipelining: Discussions

Pipelined uarch

Is this a good partitioning? Why not 4 or 6 stages? Why not different boundaries?

Pipeline Considerations

!  How to partition?

!  How many stages?

Pipeline Partitioning: Resource Requirement

!  The goal: no shared resources among different pipeline stages "  i.e., No resource is used by more than 1 stage "  Otherwise, we have resource contention or structural hazard

!  Example: need to be able to fetch instructions (in IF stage) and load data (in MEM stage) at the same time "  Single memory interface not sufficient "  Solution 1: provide two separate interfaces via instruction and

data caches "  Solution 2:??

How Many Pipeline Stages? !  BW (bandwidth), a.k.a. throughput: 1/cycle time !  Ideally, sequential elements (pipeline registers) do no

impose additional delays/cost

combinaRonallogic(F,D,E,M,W)Tpse

BW=~(1/T)

BW=~(2/T)T/2ps(F,D,E) T/2ps(M,W)

BW=~(3/T)T/3ps(F,D)

T/3ps(E,M)

T/3ps(M,W)

!  NonpipelinedversionwithdelayT

BW=1/(T+S)whereS=sequenRalelementdelay

!  k-stagepipelinedversion BWk-stage=1/(T/k+S)

BWmax=1/(1gatedelay+S)

Sequential element delay reduces BW (switching overhead b/w stages)

How Many Pipeline Stages?

How Many Pipeline Stages? !  NonpipelinedversionwithcombinaRonalcostG

Cost=G+LwhereL=sequenRalelementcost

!  k-stagepipelinedversion Costk-stage~=G+Lk

Ggates

G/k G/k

Sequential elements increase hardware cost

It is critical to balance the tradeoffs i.e., how many stages and what is done in each stage

Properties of An Ideal Pipeline !  Goal: Increase throughput with little increase in cost

(hardware cost, in case of instruction processing)

!  Repetition of identical operations "  The same operation is repeated on a large number of different

inputs (e.g., all laundry loads go through the same steps)

!  Repetition of independent operations "  No dependencies between repeated operations

!  Uniformly partitionable suboperations "  Processing an be evenly divided into uniform-latency

suboperations (that do not share resources)

!  Can you implement an ideal pipeline for instruction processing?

Instruction Pipeline: Not Ideal !  Identical operations ... NOT!

⇒ different instructions # not all need the same stages Forcing different instructions to go through the same pipe stages # external fragmentation (some pipe stages idle for some instructions)

!  Uniform suboperations ... NOT! ⇒ different pipeline stages # not the same latency

Need to force each stage to be controlled by the same clock # internal fragmentation (some pipe stages are too fast but all take the same clock cycle time)

!  Independent operations ... NOT! ⇒ instructions are not independent of each other

Need to detect and resolve inter-instruction dependencies to ensure the pipeline provides correct results # pipeline stalls (pipeline is not always moving)

Instruction Pipeline: Not Ideal !  Identical operations ... NOT!

⇒ different instructions # not all need the same stages Forcing different instructions to go through the same pipe stages # external fragmentation (some pipe stages idle for some instructions)

!  Examples "  Add, Branch: no need to go through the MEM stage "  Others?

!  Performance impact?

Instruction Pipeline: Not Ideal

!  Uniform suboperations ... NOT! ⇒ different pipeline stages # not the same latency

Need to force each stage to be controlled by the same clock # internal fragmentation (some pipe stages are too fast but all take the same clock cycle time)

Non-Uniform Operations: Laundry Analogy

the slowest step decides throughput

Non-Uniform Operations: Example

200ps 100ps 200ps 200ps 100ps

20040060080010001200140016001800

200400600800100012001400

200ps200ps200ps200ps200ps

Non-Uniform Operations: Example

Instruction Pipeline: Not Ideal

!  Independent operations ... NOT! ⇒ instructions are not independent of each other

Need to detect and resolve inter-instruction dependencies to ensure the pipeline provides correct results # pipeline stalls (pipeline is not always moving)

Dependencies and Their Types !  Also called “hazards”

!  Two types "  Data dependency "  Control dependency

Data Dependency Handling

Data Dependency Types

Flowdependencyr3←r1opr2Read-afer-Write(RAW)r5 ←r3opr4

AnRdependencyr5 ←r3opr4 Write-afer-Read(WAR)r3 ←r6opr7

Output-dependencyr3 ←r1opr2 Write-afer-Write(WAW)r5 ←r3opr4 r3 ←r6opr7

Data Dependency Types !  Flow dependencies always need to be obeyed because they

constitute true dependence on a value

!  Anti and output dependencies exist due to limited number of architectural registers "  They are dependence on a name, not a value

!  Anti and output dependences are easier to handle "  Write to the destination in one stage and in program order

!  Flow dependences are more interesting

Ways of Handling Flow Dependencies

!  Detect and wait until value is available in register file !  Detect and forward/bypass data to dependent instruction !  Detect and eliminate the dependence at the software level

"  No need for the hardware to detect dependence

!  Predict the needed value(s), execute “speculatively”, and verify

!  Do something else (fine-grained multithreading) "  No need to detect

Flow Dependency Example !  Consider this sequence:

SUB X2, X1,X3 AND X12,X2,X5 OR X13,X2,X6 ADD X14,X2,X2 STUR X15,[X2,#100]

!  SUB writing to X2 and ADD reading from it in the same cycle !  Assume “internal forwarding” in register file

"  i.e., ADD gets the new X2 value produced from SUB

WBIF ID

SUB X2,X1,X3

AND X12,X2,X5

MEMIF ID EX

IF ID EX

OR X13,X2,X6

ADD X14,X2,X2

STUR X15,[X2,#100]

Flow Dependency Example Time

How to Detect Flow Dependencies in HW?

!  Instructions IA and IB (where IA comes before IB) have RAW dependency iff "  IB (R/I, LDUR, or STUR) reads a register written by IA (R/I or

LDUR) and "  dist(IA, IB) < dist(ID, WB) = 3

R/I-Type LDUR STUR B

ID readRF readRF readRF

WB writeRF writeRF

Flow Dependency Check Logic !  Helper functions

"  Op1(I) and Op2(I) returns the 1st and 2nd register operand field of I, respectively

"  Use_Op1(I) returns true if I requires the 1st register operands and the register is not X31; similarly for Use_Op2(I)

!  Flow dependency occurs when "  (Op1(IRID)==destEX) && use_Op1(IRID) && RegWriteEX or "  (Op1(IRID)==destMEM) && use_Op1(IRID) && RegWriteMEM or "  (Op2(IRID)==destEX) && use_Op2(IRID) && RegWriteEX or "  (Op2(IRID)==destMEM) && use_Op2(IRID) && RegWriteMEM

Resolving Data Dependence

IF ID ALU MEMIF ID ALU MEM

IF ID ALU MEMIF ID ALU

t0 t1 t2 t3 t4 t5

IF ID MEMIF ID ALU

InstiInstjInstkInstl

i:rx←_j:_←rx dist(i,j)=1

WBMEMALU

i:rx←_bubblej:_←rx dist(i,j)=2

t0 t1 t2 t3 t4 t5

IF ID ALUIF ID

i:rx←_bubblebubblej:_←rx dist(i,j)=3

IF ID ALUIF ID

t0 t1 t2 t3 t4 t5

MEMALUID

IF IDIF

t0 t1 t2 t3 t4 t5

MEMALU

IFALUID

Stall = make the dependent instruction wait until its source data value is available

1. stop all up-stream stages 2. drain all down-stream stages

Option 1: Stall the pipeline (i.e., Inserting “bubbles”)

t0 t1 t2 t3 t4 t5

IF ID MEMIF ID ALU

i:rx←_j:_←rx dist(i,j)=1

WBMEMALU

i:rx←_bubblej:_←rx dist(i,j)=2

t0 t1 t2 t3 t4 t5

IF ID ALUIF ID

IF IDIF

t0 t1 t2 t3 t4 t5

ALU ID

IF IDj

t0 t1 t2 t3 t4 t5

InstiBubble(nop)Bubble(nop)Instj

Option 1: Stall the pipeline (i.e., Inserting “bubbles”)

IDjIFkIFk ID ALUInstk

How to Stall?

!  Prevent update of PC and IF/ID registers # ensure stalled instructions stay in their stages "  Use a “write enable” signal for

the register

!  Force control values in ID/EX registers to 0 # nop (no-operation)

!  It is crucial that the EX, MEM and WB stages continue to advance normally during stall cycles

Register with Write Control !  Only updates on clock edge when write control input is 1 !  Used when stored value is required later

Q Write

Impact of Stall on Performance !  Each stall cycle corresponds to one lost cycle in which no

instruction can be completed

!  For a program with N instructions and S stall cycles, Average CPI=(N+S)/N

!  S depends on "  frequency of RAW dependences "  exact distance between the dependent instructions "  distance between dependencies

suppose i1,i2 and i3 all depend on i0, once i1’s dependence is resolved, i2 and i3 must be okay too

Reducing Stalls with Data Forwarding !  Also called Data Bypassing

!  Forward the value to the dependent instruction as soon as it is available

ID IDIF ID

WBIF ID EX MEMSUB X2,X1,X3

AND X12,X2,X5 MEMIF EX WB

!  Option 2: data forwarding / bypassing

!  Instructions IA and IB (where IA comes before IB) have flow dependency"  i.e.,if IB in ID stage reads a register written by IA in EX or MEM

stage, then the operand required by IB is not yet in RF

⇒ retrieve operand from datapath instead of the RF

Double Data Dependency !  Consider the sequence:

add X1,X1,X2 and X1,X1,X3 sub X1,X1,X4

retrieve operand from the younger instruction if data dependencies occur in multiple outstanding instructions

WBIF ID

ADD X1,X1,X2

AND X1,X1,X3

MEMIF ID EXSUB X1,X1,X4

Incorrect forwarding

Correct forwarding

Datapath with Forwarding

Forwarding Paths

dist(i,j)=1dist(i,j)=2

dist(i,j)=3

Forwarding Conditions and Logic Mux control Source Explanation ForwardA = 00 ID/EX First operand comes from register file.

ForwardA = 10 EX/MEM First operand is forwarded from ALU result of the previous instruction.

ForwardA = 01 MEM/WB First operand is forwarded from data memory or an earlier ALU result.

if (Op1EX!=X31) && (Op1EX==destMEM) && RegWriteMEM then forward operand from EX/MEM stage // dist=1

else if (Op1EX!=X31) && (Op1EX==destWB) && RegWriteWB then forward operand from MEM/WB stage // dist=2

else use operand from register file // dist >= 3

Ordering matters!! Must check youngest match first

What does the above not take into account?

Similar for ForwardB

Data Forwarding (Dependency Analysis)

!  Even with data-forwarding, RAW dependence on an immediately preceding LDUR instruction, this is called load-use dependency

!  Requires a stall

R/I-Type LDUR STUR B

produce use (use)

MEM produceActually

Load-Use Data Dependency Example !  Consider this sequence:

LDUR X2, [X1,#20] AND X4,X2,X5 OR X8,X2,X6

WBIF ID

LDUR X2,[X1,#20]

AND X4,X2,X5

MEMIF ID EXOR X8,X2,X6

Incorrect forwarding

Load-Use Data Dependency Example

Bubble inserted here

Load-Use Dependency Detection

!  Load-use dependency when "  (OpcodeEX == LDUR) and

((destEX == Op1ID) or (destEX == Op2ID)) and (destEX != X31)

!  If detected, stall and insert bubble

!  What if the instruction following LDUR is STUR?

Pipeline with Data Dependency Handling

Load-use dependency

detection

More Optimization Opportunities

!  Consider this sequence !  Requires 1 stall LDUR X2, [X1,#20] AND X4,X2,X5 OR X8,X3,X6

!  Why not re-order the instructions, as long as program semantic is preserved? !  No need to stall LDUR X2, [X1,#20] OR X8,X3,X6

AND X4,X2,X5

HW vs. SW in Dependency Handling !  Software-based vs. hardware-based interlocking

"  Who inserts/manages the pipeline bubbles? "  Who finds the independent instructions and re-orders

instructions to fill “empty” pipeline slots?

!  What are the advantages/disadvantages of each?

!  Software based scheduling of instructions # static scheduling "  Compiler orders the instructions, hardware executes them in

that order "  Contrast this with dynamic scheduling (in which hardware can

execute instructions out of the compiler-specified order) "  How does the compiler know the latency of each instruction?

!  What information does the compiler not know that makes static scheduling difficult? "  Answer: Anything that is determined at run time

!  Variable-length operation latency, branch direction

!  How can the compiler alleviate this (i.e., estimate the unknown)? "  Answer: Profiling

HW vs. SW in Dependency Handling

CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture...

Documents

Transcript of CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture...

CMSC 601: Topics

CMSC 341: Data Structures

CMSC 466 / 666

CMSC 601: Proposals

CMSC 471 Fall 2009

CMSC 22200 Computer Architecture - Department of … · · 2016-09-29CMSC 22200 Computer Architecture Lecture 2: ISA ... CISC vs. RISC ! Example: x86 ! Each x86 instruction can

CMSC 491/635

CMSC 681 Project

HONR 300/CMSC 491

Lecture 3 Feedforward Networks and Backpropagation · Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago

CMSC 611: Advanced Computer Architectureolano/class/611-03-8/intro.pdf · Introduction CMSC 611: Advanced Computer Architecture Some material adapted from Mohamed Younis, UMBC CMSC

CMSC 150 Loops

CMSC 150 classes

CMSC San Diego May 31, 2012 · CMSC San Diego May 31, 2012 CMSC San Diego, May 31, 2012

Lecture 1 Introduction - University of Chicago · Lecture 1 Introduction CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 27, 2017 Lecture 1 Introduction

CMSC 330: Organization of Programming Languages · CMSC 330: Organization of Programming Languages Lambda Calculus Encodings CMSC 330 Fall 2019 1

1 CMSC 250 Discrete Structures CMSC 250 Lecture 1.

CMSC 104, Version 8/061L01Introduction.ppt Introduction CMSC 104 Problem Solving and Computer Programming.

Cmsc 100 (web forms)

CMSC 671 Fall 2005