CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture...
Transcript of CMSC 22200 Computer Architecture - University of Chicago · CMSC 22200 Computer Architecture...
CMSC 22200 Computer Architecture
Lecture 5: Pipelining: Data Dependency Handling
Prof. Yanjing Li University of Chicago
Administrative Stuff
2
! Lab1 " Aim to finish grading by Thursday
! Lab2: due next Thursday (10/20) " You should have started by now " Fix Lab1 first! Use the “golden” simulator
! And get help!
! Lab2 and Lab3 late penalty waived " Still need to hand in within 48 hours past due date
! Need to reschedule lecture for next Tuesday (10/18)
Lecture Outline
3
! Pipelining discussions
! Non-ideal pipeline
! Dependencies and how to handle them
Single Cycle uarch: Datapath & Control
4 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2017Elsevier.ALLRIGHTSRESERVED.]
Putting It All Together: R-Type ALU
5 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2017Elsevier.ALLRIGHTSRESERVED.]
ALUop
10
0
Putting It All Together: LDUR
6 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2017Elsevier.ALLRIGHTSRESERVED.]
ADD
10
1
Putting It All Together: STUR
7 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2017Elsevier.ALLRIGHTSRESERVED.]
ADD
01
0
Putting It All Together: B
8 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2017Elsevier.ALLRIGHTSRESERVED.]
X
00
0
Putting It All Together: CBZ (Taken)
9 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2017Elsevier.ALLRIGHTSRESERVED.]
cmp
00
0
Putting It All Together: CBZ (Not Taken)
10 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2017Elsevier.ALLRIGHTSRESERVED.]
cmp
00
0
Single Cycle uArch: Summary
! Inefficient " All instructions run as slow as the slowest instruction
! Not necessarily the simplest way to implement an ISA " Single-cycle implementation of REP MOVS (x86)?
! Not easy to optimize/improve performance " Optimizing the common case (e.g. common instructions) does not work " Need to optimize the worst case all the time
! All resources are not fully utilized " e.g., data memory access can’t overlap with ALU operation
! How to do better? 11
Single-Cycle, Multi-Cycle, Pipelining
! Single-cycle: 1 cycle per instruction, long cycle time
! Multi-cycle: 5 cycles per instruction, short cycle time
! Pipeline: 1 cycle per instruction (steady state), short cycle time
12
Time
F D E M W F D E M W
F D E M W
F D E M W F D E M W
F D E M W F D E M W
F D E M W F D E M W
Adding Pipeline Registers ! Registers between stages to hold information produced in
previous cycle
IRD
PCD
PCE
nPC M
AE
B E
Imm
E
Aou
t M
B M
MDR W
Aou
t W
13 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2017Elsevier.ALLRIGHTSRESERVED.]
Corrected from last lecture: latching PC instead of PC+4
Pipelined Control
Identical set of control points as the single-cycle uarch!! 14 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT
2017Elsevier.ALLRIGHTSRESERVED.]
Pipelined uarch
15 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2017Elsevier.ALLRIGHTSRESERVED.]
Is this a good partitioning? Why not 4 or 6 stages? Why not different boundaries?
Performance Analysis
16
Terminologies and Definitions
! CPI: cycle per instruction
! IPC: instruction per cycle, which is 1/CPI
! Execution time of an instruction " {CPI} x {clock cycle time}
! Execution time of an program " Sum over all instructions [ {CPI} x {clock cycle time} ] " {# of instructions} x {average CPI} x {clock cycle time}
17
Examples ! Remember: execution time of a program
" Sum over all instructions [ {CPI} x {clock cycle time} ] " {# of instructions} x {average CPI} x {clock cycle time}
! Single-cycle uarch " CPI = 1, but clock cycle time is long
! Multi-cycle uarch (with 5 stages) " CPI = 5, but clock cycle time is short
! Pipelined uarch (with 5 stages) " CPI = 1 (steady state), clock cycle time same with multi-cycle " This is the ideal case
18
Pipelining: Discussions
19
Pipelined uarch
20 **Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2017Elsevier.ALLRIGHTSRESERVED.]
Is this a good partitioning? Why not 4 or 6 stages? Why not different boundaries?
Pipeline Considerations
! How to partition?
! How many stages?
21
Pipeline Partitioning: Resource Requirement
! The goal: no shared resources among different pipeline stages " i.e., No resource is used by more than 1 stage " Otherwise, we have resource contention or structural hazard
! Example: need to be able to fetch instructions (in IF stage) and load data (in MEM stage) at the same time " Single memory interface not sufficient " Solution 1: provide two separate interfaces via instruction and
data caches " Solution 2:??
22
How Many Pipeline Stages? ! BW (bandwidth), a.k.a. throughput: 1/cycle time ! Ideally, sequential elements (pipeline registers) do no
impose additional delays/cost
23
combinaRonallogic(F,D,E,M,W)Tpse
BW=~(1/T)
BW=~(2/T)T/2ps(F,D,E) T/2ps(M,W)
BW=~(3/T)T/3ps(F,D)
T/3ps(E,M)
T/3ps(M,W)
! NonpipelinedversionwithdelayT
BW=1/(T+S)whereS=sequenRalelementdelay
! k-stagepipelinedversion BWk-stage=1/(T/k+S)
BWmax=1/(1gatedelay+S)
24
Tps
T/kps
T/kps
Sequential element delay reduces BW (switching overhead b/w stages)
How Many Pipeline Stages?
How Many Pipeline Stages? ! NonpipelinedversionwithcombinaRonalcostG
Cost=G+LwhereL=sequenRalelementcost
! k-stagepipelinedversion Costk-stage~=G+Lk
25
Ggates
G/k G/k
Sequential elements increase hardware cost
It is critical to balance the tradeoffs i.e., how many stages and what is done in each stage
Properties of An Ideal Pipeline ! Goal: Increase throughput with little increase in cost
(hardware cost, in case of instruction processing)
! Repetition of identical operations " The same operation is repeated on a large number of different
inputs (e.g., all laundry loads go through the same steps)
! Repetition of independent operations " No dependencies between repeated operations
! Uniformly partitionable suboperations " Processing an be evenly divided into uniform-latency
suboperations (that do not share resources)
! Can you implement an ideal pipeline for instruction processing?
26
Instruction Pipeline: Not Ideal ! Identical operations ... NOT!
⇒ different instructions # not all need the same stages Forcing different instructions to go through the same pipe stages # external fragmentation (some pipe stages idle for some instructions)
! Uniform suboperations ... NOT! ⇒ different pipeline stages # not the same latency
Need to force each stage to be controlled by the same clock # internal fragmentation (some pipe stages are too fast but all take the same clock cycle time)
! Independent operations ... NOT! ⇒ instructions are not independent of each other
Need to detect and resolve inter-instruction dependencies to ensure the pipeline provides correct results # pipeline stalls (pipeline is not always moving)
27
Instruction Pipeline: Not Ideal ! Identical operations ... NOT!
⇒ different instructions # not all need the same stages Forcing different instructions to go through the same pipe stages # external fragmentation (some pipe stages idle for some instructions)
! Examples " Add, Branch: no need to go through the MEM stage " Others?
! Performance impact?
28
Instruction Pipeline: Not Ideal
! Uniform suboperations ... NOT! ⇒ different pipeline stages # not the same latency
Need to force each stage to be controlled by the same clock # internal fragmentation (some pipe stages are too fast but all take the same clock cycle time)
29
Non-Uniform Operations: Laundry Analogy
30
the slowest step decides throughput
Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2004Elsevier.ALLRIGHTSRESERVED.]
Non-Uniform Operations: Example
31
200ps 100ps 200ps 200ps 100ps
Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2004Elsevier.ALLRIGHTSRESERVED.]
IRD
PCD
PCE
nPC M
AE
B E
Imm
E
Aou
t M
B M
MDR W
Aou
t W
32
20040060080010001200140016001800
200400600800100012001400
800ps
800ps
800ps
200ps200ps200ps200ps200ps
200ps
200ps
Non-Uniform Operations: Example
Instruction Pipeline: Not Ideal
! Independent operations ... NOT! ⇒ instructions are not independent of each other
Need to detect and resolve inter-instruction dependencies to ensure the pipeline provides correct results # pipeline stalls (pipeline is not always moving)
33
Dependencies and Their Types ! Also called “hazards”
! Two types " Data dependency " Control dependency
34
Data Dependency Handling
35
Data Dependency Types
36
Flowdependencyr3←r1opr2Read-afer-Write(RAW)r5 ←r3opr4
AnRdependencyr5 ←r3opr4 Write-afer-Read(WAR)r3 ←r6opr7
Output-dependencyr3 ←r1opr2 Write-afer-Write(WAW)r5 ←r3opr4 r3 ←r6opr7
Data Dependency Types ! Flow dependencies always need to be obeyed because they
constitute true dependence on a value
! Anti and output dependencies exist due to limited number of architectural registers " They are dependence on a name, not a value
! Anti and output dependences are easier to handle " Write to the destination in one stage and in program order
! Flow dependences are more interesting
37
Ways of Handling Flow Dependencies
! Detect and wait until value is available in register file ! Detect and forward/bypass data to dependent instruction ! Detect and eliminate the dependence at the software level
" No need for the hardware to detect dependence
! Predict the needed value(s), execute “speculatively”, and verify
! Do something else (fine-grained multithreading) " No need to detect
38
Flow Dependency Example ! Consider this sequence:
SUB X2, X1,X3 AND X12,X2,X5 OR X13,X2,X6 ADD X14,X2,X2 STUR X15,[X2,#100]
! SUB writing to X2 and ADD reading from it in the same cycle ! Assume “internal forwarding” in register file
" i.e., ADD gets the new X2 value produced from SUB
40
MEM
WBIF ID
IF
EX
ID
MEM
EX WB
SUB X2,X1,X3
AND X12,X2,X5
MEMIF ID EX
IF ID EX
IF ID
OR X13,X2,X6
ADD X14,X2,X2
STUR X15,[X2,#100]
?
Flow Dependency Example Time
How to Detect Flow Dependencies in HW?
! Instructions IA and IB (where IA comes before IB) have RAW dependency iff " IB (R/I, LDUR, or STUR) reads a register written by IA (R/I or
LDUR) and " dist(IA, IB) < dist(ID, WB) = 3
41
R/I-Type LDUR STUR B
IF
ID readRF readRF readRF
EX
MEM
WB writeRF writeRF
Flow Dependency Check Logic ! Helper functions
" Op1(I) and Op2(I) returns the 1st and 2nd register operand field of I, respectively
" Use_Op1(I) returns true if I requires the 1st register operands and the register is not X31; similarly for Use_Op2(I)
! Flow dependency occurs when " (Op1(IRID)==destEX) && use_Op1(IRID) && RegWriteEX or " (Op1(IRID)==destMEM) && use_Op1(IRID) && RegWriteMEM or " (Op2(IRID)==destEX) && use_Op2(IRID) && RegWriteEX or " (Op2(IRID)==destMEM) && use_Op2(IRID) && RegWriteMEM
42
Resolving Data Dependence
43
IF
WB
IF ID ALU MEMIF ID ALU MEM
IF ID ALU MEMIF ID ALU
t0 t1 t2 t3 t4 t5
IF ID MEMIF ID ALU
IF ID
InstiInstjInstkInstl
WBWB
i:rx←_j:_←rx dist(i,j)=1
i
j
Insth
WBMEMALU
i:rx←_bubblej:_←rx dist(i,j)=2
WB
IF ID ALU MEMIF ID ALU MEM
IF ID ALU MEMIF ID ALU
t0 t1 t2 t3 t4 t5
MEM
InstiInstjInstkInstl
WBWBi
j
Insth
IDIF
IF
IF ID ALUIF ID
i:rx←_bubblebubblej:_←rx dist(i,j)=3
IF ID ALU MEMIF ID ALU MEM
IF ID ALUIF ID
t0 t1 t2 t3 t4 t5
IF
MEMALUID
InstiInstjInstkInstl
WBWBi
j
Insth
IDIF
IDIF
ID
IF ID ALU MEMIF ID ALU
IF IDIF
t0 t1 t2 t3 t4 t5
MEMALU
InstiInstjInstkInstl
WBi
j
Insth
IFALUID
Stall = make the dependent instruction wait until its source data value is available
1. stop all up-stream stages 2. drain all down-stream stages
Option 1: Stall the pipeline (i.e., Inserting “bubbles”)
Resolving Data Dependence
44
IF
WB
IF ID ALU MEMIF ID ALU MEM
IF ID ALU MEMIF ID ALU
t0 t1 t2 t3 t4 t5
IF ID MEMIF ID ALU
IF ID
InstiInstjInstkInstl
WBWB
i:rx←_j:_←rx dist(i,j)=1
i
j
Insth
WBMEMALU
i:rx←_bubblej:_←rx dist(i,j)=2
WB
IF ID ALU MEMIF ID ALU MEM
IF ID ALU MEMIF ID ALU
t0 t1 t2 t3 t4 t5
MEM
InstiInstjInstkInstl
WBWBi
j
Insth
IDIF
IF
IF ID ALUIF ID
IF ID ALU MEMIF ID ALU MEM
IF IDIF
t0 t1 t2 t3 t4 t5
ALU ID
InstiInstjInstkInstl
WBWB
Insth
IFMEM
IF ID ALU MEMIF ID ALU
IF IDj
IFk
t0 t1 t2 t3 t4 t5
InstiBubble(nop)Bubble(nop)Instj
WBi
Insth
IDj
Option 1: Stall the pipeline (i.e., Inserting “bubbles”)
IDjIFkIFk ID ALUInstk
j
k
How to Stall?
! Prevent update of PC and IF/ID registers # ensure stalled instructions stay in their stages " Use a “write enable” signal for
the register
! Force control values in ID/EX registers to 0 # nop (no-operation)
! It is crucial that the EX, MEM and WB stages continue to advance normally during stall cycles
[BasedonoriginalfigurefromP&HCO&D,COPYRIGHT2016Elsevier.ALLRIGHTSRESERVED.]
Register with Write Control ! Only updates on clock edge when write control input is 1 ! Used when stored value is required later
D
Clk
Q Write
Write
D
Q
Clk
[BasedonoriginalfigurefromP&HCO&D,COPYRIGHT2016Elsevier.ALLRIGHTSRESERVED.]
Impact of Stall on Performance ! Each stall cycle corresponds to one lost cycle in which no
instruction can be completed
! For a program with N instructions and S stall cycles, Average CPI=(N+S)/N
! S depends on " frequency of RAW dependences " exact distance between the dependent instructions " distance between dependencies
suppose i1,i2 and i3 all depend on i0, once i1’s dependence is resolved, i2 and i3 must be okay too
47
Reducing Stalls with Data Forwarding ! Also called Data Bypassing
! Forward the value to the dependent instruction as soon as it is available
48
ID IDIF ID
WBIF ID EX MEMSUB X2,X1,X3
AND X12,X2,X5 MEMIF EX WB
! Option 2: data forwarding / bypassing
! Instructions IA and IB (where IA comes before IB) have flow dependency" i.e.,if IB in ID stage reads a register written by IA in EX or MEM
stage, then the operand required by IB is not yet in RF
⇒ retrieve operand from datapath instead of the RF
49
Resolving Data Dependence
Double Data Dependency ! Consider the sequence:
add X1,X1,X2 and X1,X1,X3 sub X1,X1,X4
retrieve operand from the younger instruction if data dependencies occur in multiple outstanding instructions
MEM
WBIF ID
IF
EX
ID
MEM
EX WB
ADD X1,X1,X2
AND X1,X1,X3
MEMIF ID EXSUB X1,X1,X4
Incorrect forwarding
Correct forwarding
Datapath with Forwarding
[BasedonoriginalfigurefromP&HCO&D,COPYRIGHT2016Elsevier.ALLRIGHTSRESERVED.]
Forwarding Paths
dist(i,j)=1dist(i,j)=2
dist(i,j)=3
[BasedonoriginalfigurefromP&HCO&D,COPYRIGHT2016Elsevier.ALLRIGHTSRESERVED.]
Forwarding Conditions and Logic Mux control Source Explanation ForwardA = 00 ID/EX First operand comes from register file.
ForwardA = 10 EX/MEM First operand is forwarded from ALU result of the previous instruction.
ForwardA = 01 MEM/WB First operand is forwarded from data memory or an earlier ALU result.
if (Op1EX!=X31) && (Op1EX==destMEM) && RegWriteMEM then forward operand from EX/MEM stage // dist=1
else if (Op1EX!=X31) && (Op1EX==destWB) && RegWriteWB then forward operand from MEM/WB stage // dist=2
else use operand from register file // dist >= 3
Ordering matters!! Must check youngest match first
What does the above not take into account?
Similar for ForwardB
Data Forwarding (Dependency Analysis)
! Even with data-forwarding, RAW dependence on an immediately preceding LDUR instruction, this is called load-use dependency
! Requires a stall
54
R/I-Type LDUR STUR B
IF
ID
EXuse
produce use (use)
MEM produceActually
use
WB
Load-Use Data Dependency Example ! Consider this sequence:
LDUR X2, [X1,#20] AND X4,X2,X5 OR X8,X2,X6
MEM
WBIF ID
IF
EX
ID
MEM
EX WB
LDUR X2,[X1,#20]
AND X4,X2,X5
MEMIF ID EXOR X8,X2,X6
Incorrect forwarding
Load-Use Data Dependency Example
Bubble inserted here
[BasedonoriginalfigurefromP&HCO&D,COPYRIGHT2016Elsevier.ALLRIGHTSRESERVED.]
Load-Use Dependency Detection
! Load-use dependency when " (OpcodeEX == LDUR) and
((destEX == Op1ID) or (destEX == Op2ID)) and (destEX != X31)
! If detected, stall and insert bubble
! What if the instruction following LDUR is STUR?
Pipeline with Data Dependency Handling
[BasedonoriginalfigurefromP&HCO&D,COPYRIGHT2016Elsevier.ALLRIGHTSRESERVED.]
Load-use dependency
detection
More Optimization Opportunities
59
! Consider this sequence ! Requires 1 stall LDUR X2, [X1,#20] AND X4,X2,X5 OR X8,X3,X6
! Why not re-order the instructions, as long as program semantic is preserved? ! No need to stall LDUR X2, [X1,#20] OR X8,X3,X6
AND X4,X2,X5
HW vs. SW in Dependency Handling ! Software-based vs. hardware-based interlocking
" Who inserts/manages the pipeline bubbles? " Who finds the independent instructions and re-orders
instructions to fill “empty” pipeline slots?
! What are the advantages/disadvantages of each?
60
! Software based scheduling of instructions # static scheduling " Compiler orders the instructions, hardware executes them in
that order " Contrast this with dynamic scheduling (in which hardware can
execute instructions out of the compiler-specified order) " How does the compiler know the latency of each instruction?
! What information does the compiler not know that makes static scheduling difficult? " Answer: Anything that is determined at run time
! Variable-length operation latency, branch direction
! How can the compiler alleviate this (i.e., estimate the unknown)? " Answer: Profiling
61
HW vs. SW in Dependency Handling