1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:
-
Upload
reynold-roderick-powell -
Category
Documents
-
view
221 -
download
0
description
Transcript of 1 Pipelining (Chapter 8) TU-Delft TI1400/12-PDS Course website:
1
Pipelining(Chapter 8)
TU-DelftTI1400/12-PDS
http://www.pds.ewi.tudelft.nl/~iosup/Courses/2012_ti1400_8.ppt
Course website:http://www.pds.ewi.tudelft.nl/~iosup/Courses/2012_ti1400_results.htm
TU-DelftTI1400/12-PDS
2
Basic idea (1)
F1 E1 F2 F3 F4E2 E3 E4I1 I2 I3 I4
sequential execution time
B1
Instructionfetchunit
Executionunit
buffer
TU-DelftTI1400/12-PDS
3
Basic idea (2): Overlap
F1 E1
F2
F3
F4
E2
E3
E4
I1
I2
I3
I4
pipelined execution
time
1 2 3 4 5 Clock cycle
TU-DelftTI1400/12-PDS
4
Instruction phases
• F Fetch instruction• D Decode instruction and fetch operands• O Perform operation• W Write result
TU-DelftTI1400/12-PDS
5
Four-stage pipeline
F1 D1
F2
F3
F4
D2
D3
D4
I1
I2
I3
I4
pipelined execution
time
1 2 3 4 5 Clock cycleO1 W1
O2 W2
O3 W3
O4 W4
TU-DelftTI1400/12-PDS
6
Hardware organization (1)
Fetchunit
B1
Decodeand
fetchoper.
B2
Operunit
B3
Writeunit
TU-DelftTI1400/12-PDS
7
Hardware organization (2)
During cycle 4, the buffers contain:• B1:
- instruction I3• B2:
- the source operands of I2- the specification of the operation- the specification of the destination operand
• B3:- the result of the operation of I1- the specification of the destination operand
TU-DelftTI1400/12-PDS
8
Hardware organization (3)
Fetchunit
B1
Decodeand
fetchoper.
B2
Operunit
B3
Writeunit
I3 Operands I2Operation I2
Result I1
TU-DelftTI1400/12-PDS
9
Pipeline stall (1)
• Pipeline stall: delay in a stage of the pipeline due to an instruction
• Reasons for pipeline stall:- Cache miss- Long operation (for example, division)- Dependency between successive instructions- Branching
TU-DelftTI1400/12-PDS
10
Pipeline stall (2): Cache miss
F1 D1
F2
F3
D2
D3
I1
I2
I3
time
1 2 3 4 5 Clock cycleO1 W1
O2 W2
O3 W3
6 7 8
Cache miss in I2
TU-DelftTI1400/12-PDS
11
Pipeline stall (3): Cache miss
F1 F2
D2
F
D
O
1 2 3 4 5 Clock cycle
F2 F2
D3
6 7 8
W
D1
F2 F3
idle idle idle
O1 O2 O3idle idle idle
W1 W2 W3idle idle idle
Effect of cache miss in F2
TU-DelftTI1400/12-PDS
12
Pipeline stall (4): Long operation
F2 D2I2 O2 W2
F3 D3I3 O3 W3
F4 D4I4 O4 W4
time
F1 D1I11 2 3 4 5 Clock cycle
O1 W16 7 8
TU-DelftTI1400/12-PDS
13
Pipeline stall (5): Dependencies
• Instructions:ADD R1, 3(R1)ADD R4, 4(R1)
cannot be done in parallel• Instructions:
ADD R2, 3(R1)ADD R4, 4(R3)
can be done in parallel
TU-DelftTI1400/12-PDS
14
Pipeline stall (6): Branch
time
Ii
Ik
Fi Ei
Fk Ek
(branch)
Pipeline stall due to Branch
only start fetching instructions after branch has beenexecuted
TU-DelftTI1400/12-PDS
15
Data dependency (1): example
MUL R2,R3,R4 /* R4 destination */
ADD R5,R4,R6 /* R6 destination */
New value of R4 must be available before ADD instruction uses it
TU-DelftTI1400/12-PDS
16
Data dependency (2): example
timeI1 F1 D1 O1 W1
F2 D2 O2 W2I2
W3F3 D3 O3I3
I4 F4 D4 O4 W4
MUL
ADD
Pipeline stall due to data dependence between W1 and D2
TU-DelftTI1400/12-PDS
17
Branching: Instruction queue
Fetch
Dispatch Operation Write
instruction queue........
TU-DelftTI1400/12-PDS
18
Idling at branch
time
Ij
Ij+1
Fj Ej
Fj+1
(branch)
Ik Fk Ek
idle
Ik+1 Fk+1 Ek+1
TU-DelftTI1400/12-PDS
19
Branch with instruction queueI1 F1 E1
I3 F3 E3
I2 F2 E2
I4 F4
Ij Fj Ej
Ij+1 Fj+1 Ej+1
Ij+2 Fj+2 Ej+2
Ij+3 Fj+3 Ej+3
time
branch
Branch folding:execute a later branch instruction simultaneously(i.e., compute target)
I4 discarded
TU-DelftTI1400/12-PDS
20
Delayed branch (1): reordering
LOOP Shift_left R1Decrement R2Branch_if>0 LOOP
NEXT Add R1,R3
LOOP Decrement R2Branch_if>0 LOOPShift_left R1
NEXT Add R1,R3
Original
Reordered alwaysexecuted
alwaysloose acycle
TU-DelftTI1400/12-PDS
21
Delayed branch (2): execution timing
F EF E
F EF E
F EF E
F E
DecrementBranchShiftDecrementBranchShiftAdd
TU-DelftTI1400/12-PDS
22
Branch prediction (1)
I1 F1 D1 E1 W1
F2
F3
F4
E2
D3 E3 X
D4 X
Fk Dk
Compare
Branch-if>I2
I3
I4
Effect of incorrect branch predictionIk
TU-DelftTI1400/12-PDS
23
Branch prediction (2)
Possible implementation:- use a single bit- bit records previous choice of branch- bit tells from which location to fetch next
instructions
TU-DelftTI1400/12-PDS
24
Data paths of CPU (1)Source 1Source 2
SRC1 SRC2
ALU
RSLT
Registerfile
Destination
Operand forwarding
TU-DelftTI1400/12-PDS
25
Data paths of CPU (2)
Operation Write
SRC1SRC2 RSLT
forwarding data path
register fileALU
TU-DelftTI1400/12-PDS
26
Pipelined operation
I1 F R1 + R3
F
Add
ShiftI2
I3
I4
R2
shift R3R3
F D O W
F D O WI1: Add R1, R2, R3I2: Shift_left R3
result of Add has tobe available
TU-DelftTI1400/12-PDS
27
Short pipeline
I1 F R1 + R3R2
F D fwd,shift R3 -
F D O W
I2
I3
TU-DelftTI1400/12-PDS
28
Long pipeline
F D O WI1 1 O2 O3
FI2
I3
D O1 O2 O3 Wfwd
F D O1 O2 O3 W
TU-DelftTI1400/12-PDS
29
Compiler solution
I1: Add R1, R2, R3I2: Shift_left R3
I1: Add R1, R2, R3NOPNOP
I2: Shift_left R3
insert no-operations towait for result
TU-DelftTI1400/12-PDS
30
Side effects
I2: ADD D1, D2
I3: ADDX D3, D4carry copy
Other form of (implicit) data dependency:instructions can have side effects that are usedby the next instruction
TU-DelftTI1400/12-PDS
31
Complex addressing mode
F D X+[R1] [X+[R1]][[X+[R1]]] R2 D
F DD Dfwd,O
Load
Next instruct. DW
Load (X(R1)), R2
Cause pipe line stall
X in instruction
TU-DelftTI1400/12-PDS
32
Simple addressing modes
F D X+[R1]
[X+[R1]]
[[X+[R1]]]
R2 DAdd
F DD
F DD
R2
R2
F DD Dfwd,O W
Load
Load
Next instruction
Add #X,R1,R2Load (R2),R2Load (R2),R2
Build up from simple instructions: same amount of time
TU-DelftTI1400/12-PDS
33
Addressing modes• Requirements addressing modes with pipelining:
- operand access not more than one memory access
- only load and store instructions access memory- addressing modes do not have side effects
• Possible addressing modes:- register- register indirect- index
TU-DelftTI1400/12-PDS
34
Condition codes (1)• Problems in RISC with condition codes
(CCs):- do instructions after reordering have access
to the right CC values?- are CCs already available at the next
instruction?• Solutions:
- compiler detection- no automatic use of CCs, only when explicitly
given in instruction
TU-DelftTI1400/12-PDS
35
Explicit specification of CCs
Increment R5Add R2, R4Add-with-increment R1, R3
ADDI R5, R5, 1ADDC R4, R2, R4ADDE R3, R1, R3
double precisionaddition
PowerPC instructions (C: change carry flag, E: use carry flag)
TU-DelftTI1400/12-PDS
36
Two execution units
Fetch
DispatchUnit
FP Unit
Write
instruction queue
IntegerUnit
........
TU-DelftTI1400/12-PDS
37
Instruction flow (superscalar)
F1 D1 O1 W1I1 O1 O1
F2 D2 O2 W2
F3 D3 O3 O3 O3
W4F4 D4 O4
W3
Fadd
I2 Add
I3 Fsub
I4 SubSimultaneous execution of floating pointand integer operations
TU-DelftTI1400/12-PDS
38
Completion in program order
D1 O1 W1I1 O1 O1
F2 D2 O2 W2
F3 D3 O3 O3 O3
W4F4 D4 O4
W3
Fadd
I2 Add
I3 Fsub
I4 Sub
F1
wait until previous instruction has completed
TU-DelftTI1400/12-PDS
39
Consequences completion order
When an exception occurs:• writes not necessarily in order of
instructions: imprecise exceptions• writes in order: precise exceptions
TU-DelftTI1400/12-PDS
40
PowerPC pipeline
Data cache Instr. cache
Instr. fetch Branch unit
Dispatcher
Instructionqueue
Completionqueue
LSUIU
FPU
store queue
TU-DelftTI1400/12-PDS
41
Performance Effects (1)
• Execution time of a program: T• Dynamic instruction count: N• Number of cycles per instruction: S• Clock rate: R• Without pipelining: T = (N x S)
/ R• With an n-stage pipeline: T’ = T /
n ???
TU-DelftTI1400/12-PDS
42
Performance Effects (2)• Cycle time: 2 ns (R is 500 MHz)• Cache hit (miss) ratio instructions: 0.95
(0.05)• Cache hit (miss) ratio data: 0.90 (0.10)• Fraction of instructions that need data
from memory: 0.30• Cache miss penalty: 17 cycles • Average extra delay per instruction:
(0.05 + 0.3 x 0.1) x 17 = 1.36 cycles, so slow down by a factor of more than
2!!
TU-DelftTI1400/12-PDS
43
Performance Effects (3)
• On average, the fetch stage takes, due to instruction cache misses:
1 + (0.05 x 17) = 1.85 cycles• On average, the decode stage takes, due
to operand cache misses:1 + (0.3 x 0.1 x 17) = 1.51 cycles
• For a total additional cost of 1.36 cycles
TU-DelftTI1400/12-PDS
44
Performance Effects (4)• If only one stage takes longer, the additional
time should be counted relative to one stage, not relative to the complete instruction:
• In other words: here, the pipeline is as slow as the slowest stage
F1 D1 O1 W1
F1 D1 O1 W1
TU-DelftTI1400/12-PDS
45
Performance Effects (5)• Delay of 1 cycle every 4 instructions in only
one stage: average penalty: 0.25
• Average inter-completion time: (3x1 + 1x2)/4=1.25
F4 D4 O4 W4
F1 D1 O1 W1
F3 D3 O3 W3
F2 D2 O2 W2
F5 D5 O5 W5
TU-DelftTI1400/12-PDS
46
Performance Effects (6)• Delays in two stages:
- k % of the instructions in one stage, penalty s cycles
- l % of the instructions in another stage, penalty t cycles
• Average inter-completion time:((100-k-l) x 1 + k(1+s) + l(1+t))/100 =
(100+ ks +lt)/100• In example (k=5, l=3, s=t=17): 2.36
TU-DelftTI1400/12-PDS
47
Performance Effects (7)• Large number of pipeline stages seems
advantageous, but: - more instructions simultaneously being
processed, so more opportunity for conflicts- branch penalty becomes larger- ALU is usually bottleneck, no use having smaller
time steps