COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.
-
Upload
eustacia-lambert -
Category
Documents
-
view
227 -
download
1
Transcript of COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.
Introduction to PipeliningIntroduction to Pipelining• Pipelining: An implementation technique that overlaps the execution of
multiple instructions. It is a keykey technique in achieving high-performance
• Laundry ExampleLaundry Example• Ann, Brian, Cathy, Dave
each have one load of clothes to wash, dry, and fold
• Washer takes 30 minutes
• Dryer takes 40 minutes
• “Folder” takes 20 minutes
A B C D
Sequential LaundrySequential Laundry
• Sequential laundry takes 6 hours for 4 loads• If they learned pipelining, how long would laundry take?
A
B
C
D
30 40 20 30 40 20 30 40 20 30 40 20
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
Pipelined LaundryPipelined LaundryStart work ASAPStart work ASAP
• Pipelined laundry takes 3.5 hours for 4 loads • Speedup = 6/3.5 = 1.7
A
B
C
D
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
30 40 40 40 40 20
5COMP381 by M. Hamdi
Pipelining Lessons• Latency vs. Throughput• Question
– What is the latency in both cases ?
– What is the throughput in both cases ?
Pipelining doesn’t help latency of single task,
It helps throughput of entire workload
A
B
C
D
30 40 40 40 40 20
6COMP381 by M. Hamdi
Pipelining Lessons [contd…]
• Question– What is the fastest operation in the example ?– What is the slowest operation in the example
Pipeline rate limited by slowest pipeline stage
A
B
C
D
30 40 40 40 40 20
7COMP381 by M. Hamdi
Pipelining Lessons [contd…]
A
B
C
D
30 40 40 40 40 20Multiple tasks operating simultaneously using
different resources
8COMP381 by M. Hamdi
Pipelining Lessons [contd…]
• Question
– Would the speedup increase if we had more steps ?
A
B
C
D
30 40 40 40 40 20
Potential Speedup = Number of pipe stages
9COMP381 by M. Hamdi
Pipelining Lessons [contd…]
• Washer takes 30 minutes
• Dryer takes 40 minutes
• “Folder” takes 20 minutes
• Question– Will it affect if “Folder” also took 40 minutes
Unbalanced lengths of pipe stages reduces speedup
10COMP381 by M. Hamdi
Pipelining Lessons [contd…]
A
B
C
D
30 40 40 40 40 20
Time to “fill” pipeline and time to “drain” it reduces speedup
11COMP381 by M. Hamdi
Pipelining a Digital System
• Key idea: break big computation up into pieces
Separate each piece with a pipeline register1ns
200ps 200ps 200ps 200ps 200ps
PipelineRegister
12COMP381 by M. Hamdi
Pipelining a Digital System
• Why do this? Because it's faster for repeated computations
1ns
Non-pipelined:1 operation finishesevery 1ns
200ps 200ps 200ps 200ps 200ps
Pipelined:1 operation finishesevery 200ps
13COMP381 by M. Hamdi
Comments about pipelining
• Pipelining increases throughput, but not latency
– Answer available every 200ps, BUT
– A single computation still takes 1ns
• Limitations:
– Computations must be divisible into stages of equal sizes
– Pipeline registers add overhead
14COMP381 by M. Hamdi
Another Example
Comb.Logic
REG
30ns 3ns
Clock
Delay = 33nsThroughput = 30MHz
Time
UnpipelinedSystem
Op1 Op2 Op3??
– One operation must complete before next can begin– Operations spaced 33ns apart
15COMP381 by M. Hamdi
3 Stage Pipelining
– Space operations 13ns apart
– 3 operations occur simultaneously
REG
Clock
Comb.Logic
REG
Comb.Logic
REG
Comb.Logic
10ns 3ns 10ns 3ns 10ns 3ns
Delay = 39nsThroughput = 77MHz
Time
Op1
Op2
Op3
Op4
16COMP381 by M. Hamdi
Limitation: Nonuniform Pipelining
Clock
REG
Com.Log.
REG
Comb.Logic
REG
Comb.Logic
5ns 3ns 15ns 3ns 10ns 3ns
Delay = 18 * 3 = 54 nsThroughput = 55MHz
• Throughput limited by slowest stage• Delay determined by clock period * number of stages
• Must attempt to balance stages
17COMP381 by M. Hamdi
Limitation: Deep Pipelines
• Diminishing returns as add more pipeline stages
• Register delays become limiting factor• Increased latency
• Small throughput gains
• More hazards
Delay = 48ns, Throughput = 128MHzClock
REG
Com.Log.
5ns 3ns
REG
Com.Log.
5ns 3ns
REG
Com.Log.
5ns 3ns
REG
Com.Log.
5ns 3ns
REG
Com.Log.
5ns 3ns
REG
Com.Log.
5ns 3ns
18COMP381 by M. Hamdi
Computer (Processor) PipeliningComputer (Processor) Pipelining
• It is one KEYKEY method of achieving High-Performance in modern microprocessors
• It is being used in many different designs (not just processors)– http://www.siliconstrategies.com/story/OEG20020820S0054
• It is a completely hardware mechanism
• A major advantage of pipelining over “parallel processing” is that it is not visiblenot visible to the programmer
• An instructioninstruction execution pipeline involves a number of steps, where each step completes a part of an instruction.
• Each step is called a pipe stage or a pipe segment.
19COMP381 by M. Hamdi
Instr 2
Pipelining• Multiple instructions overlapped in execution• Throughput optimization: doesn’t reduce time
for individual instructions
Instr 1
Stage 2Stage 3Stage 4Stage 5Stage 6Stage 7Stage 1
Instr 3Instr 2Instr 1
Stage 2Stage 3Stage 4Stage 5Stage 6Stage 7Stage 1
20COMP381 by M. Hamdi
Computer PipeliningComputer Pipelining
• The stages or steps are connected one to the next to form a pipe -- instructions enter at one end and progress through the stage and exit at the other end.
• ThroughputThroughput of an instruction pipeline is determined by how often an instruction exists the pipeline.
• The time to move an instruction one step down the line is equal to the machine cyclethe machine cycle (Clock Rate) and is determined by the stage with the longest processing delay (slowest pipeline stage).
21COMP381 by M. Hamdi
Pipelining: Design GoalsPipelining: Design Goals• An important pipeline design consideration is to
balance the length of each pipeline stage.
• If all stages are perfectly balanced, then the time per instruction on a pipelined machine (assuming ideal conditions with no stalls):
Time per instruction on unpipelined Time per instruction on unpipelined machinemachine
Number of pipe stagesNumber of pipe stages
• Under these ideal conditions:– Speedup from pipelining equals the number of pipeline
stages: n, – One instruction is completed every cycle, CPI = 1 .
22COMP381 by M. Hamdi
Pipelining: Design GoalsPipelining: Design Goals• Under these ideal conditions:
– Speedup from pipelining equals the number of pipeline stages: n,
– One instruction is completed every cycle, CPI = 1 .
– This is an asymptote of course, but +10% is commonly achieved
– Difference is due to difficulty in achieving balanced stage design
• Two ways to view the performance mechanism– Reduced CPI (i.e. non-piped to piped change)
• Close to 1 instruction/cycle if you’re lucky
– Reduced cycle-time (i.e. increasing pipeline depth)• Work split into more stages
• Simpler stages result in faster clock cycles
23COMP381 by M. Hamdi
Implementation of MIPSImplementation of MIPS
• We use the MIPS processor as an example to demonstrate the concepts of computer pipelining.
• MIPS ISA is designed based on sound measurements and sound architectural considerations (as covered in class).
• It is used by numerous companies (Nintendo and Playstation) through liscencing agreements.
• These same concepts are being used by ALL other processors as well.
24COMP381 by M. Hamdi
MIPS64 Instruction MIPS64 Instruction FormatFormat 166 5 5
ImmediatertrsOpcode
6 5 5 5 5
Opcode rs rt rd func
6 26
Opcode Offset added to PC
J - Type instruction
R - type instruction
Jump and jump and link. Trap and return from exception
Register-register ALU operations: rd rs func rt Function encodes the data path operation: Add, Sub .. Read/write special registers and moves.
Encodes: Loads and stores of bytes, words, half words. All immediates (rd rs op immediate)Conditional branch instructions (rs1 is register, rd unused)Jump register, jump and link register (rd = 0, rs = destination, immediate = 0)
I - type instruction
6
shamt
0 5 6 10 11 15 16 31
0 5 6 10 11 15 16 20 21 25 26 31
0 5 6 31
25COMP381 by M. Hamdi
A Basic Multi-Cycle A Basic Multi-Cycle Implementation of MIPSImplementation of MIPS
• Every integer MIPS instruction can be implemented in at most five clock cycles (branch – 2 cycles, Store – 4 cycles, other – 5 cycles):
1 Instruction fetch cycle (IF):
IR Mem[PC]NPC PC + 4
2 Instruction decode/register fetch cycle (ID):
A Regs[rs];B Regs[rt];
Imm ((IR16)16##IR 16..31) sign-extended immediate field of IR
Note: IR (instruction register), NPC (next sequential program counter register) A, B, Imm are temporary registers
26COMP381 by M. Hamdi
A Basic Implementation of MIPS A Basic Implementation of MIPS (continued)(continued)
3 Execution/Effective address cycle (EX):
– Memory reference:
ALUOutput A + Imm;
– Register-Register ALU instruction:
ALUOutput A op B;
– Register-Immediate ALU instruction:
ALUOutput A op Imm;
– Branch:
ALUOutput NPC + Imm;
Cond (A == 0)
27COMP381 by M. Hamdi
A Basic Implementation of MIPS (continued)A Basic Implementation of MIPS (continued)
4 Memory access/branch completion cycle (MEM):
– Memory reference:
LMD Mem[ALUOutput] orMem[ALUOutput] B;
– Branch:
if (cond) PC ALUOutput else PC NPC
Note: LMD (load memory data) register
28COMP381 by M. Hamdi
A Basic Implementation of MIPS A Basic Implementation of MIPS (continued)(continued)
5 Write-back cycle (WB):
– Register-Register ALU instruction:
Regs[rd] ALUOutput;
– Register-Immediate ALU instruction:
Regs[rt] ALUOutput;
– Load instruction:
Regs[rt] LMD;
Note: LMD (load memory data) register
29COMP381 by M. Hamdi
Basic MIPS Multi-Cycle Integer Datapath ImplementationBasic MIPS Multi-Cycle Integer Datapath Implementation
30COMP381 by M. Hamdi
Simple MIPS Pipelined Simple MIPS Pipelined Integer Instruction Integer Instruction
ProcessingProcessing Clock Number Time in clock cycles Instruction Number 1 2 3 4 5 6 7 8 9
Instruction I IF ID EX MEM WBInstruction I+1 IF ID EX MEM WBInstruction I+2 IF ID EX MEM WBInstruction I+3 IF ID EX MEM WBInstruction I +4 IF ID EX MEM WB
Time to fill the pipeline
MIPS Pipeline Stages:
IF = Instruction FetchID = Instruction DecodeEX = ExecutionMEM = Memory AccessWB = Write Back
First instruction, ICompleted
Last instruction, I+4 completed
31COMP381 by M. Hamdi
Pipelining The MIPS Processor
• There are 5 steps in instruction execution:
1. Instruction Fetch
2. Instruction Decode and Register Read
3. Execution operation or calculate address
4. Memory access
5. Write result into register
32COMP381 by M. Hamdi
Datapath for Instruction Fetch
Instruction <- MEM[PC]PC <- PC + 4
RDMemory
ADDR
PC
Instruction
4
ADD
33COMP381 by M. Hamdi
Datapath for R-Type Instructions
add rd, rs, rtR[rd] <- R[rs] + R[rt];
5 5 5
RD1
RD2
RN1 RN2 WN
WD
RegWrite
Register File
op rs rt rd functshamt
Operation
ALU Zero
Instruction
3
34COMP381 by M. Hamdi
Datapath for Load/Store Instructions
op rs rt offset/immediate
5 5
16
RD1
RD2
RN1 RN2 WN
WD
RegWrite
Register File
Operation
ALU
3
EXTND
16 32
Zero
RDWD
MemRead
MemoryADDR
MemWrite
5
lw rt, offset(rs)R[rt] <- MEM[R[rs] + s_extend(offset)];
35COMP381 by M. Hamdi
Datapath for Load/Store Instructions
op rs rt offset/immediate
5 5
16
RD1
RD2
RN1 RN2 WN
WD
RegWrite
Register File
Operation
ALU
3
EXTND
16 32
Zero
RDWD
MemRead
MemoryADDR
MemWrite
5
sw rt, offset(rs)MEM[R[rs] + sign_extend(offset)] <- R[rt]
36COMP381 by M. Hamdi
Datapath for Branch Instructions
beq rs, rt, offset
op rs rt offset/immediate
5 5
16
RD1
RD2
RN1 RN2 WN
WD
RegWrite
Register File
Operation
ALU
EXTND
16 32
Zero
ADD
<<2
PC +4 from instruction datapath
if (R[rs] == R[rt]) then PC <- PC+4 + s_extend(offset<<2)
37COMP381 by M. Hamdi
Single-Cycle Processor
5 516
RD1
RD2
RN1 RN2 WN
WD
Register File ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
5
Instruction I32
MUX
<<2RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
32
IFInstruction Fetch
IDInstruction Decode
EXExecute/ Address Calc.
MEMMemory Access
WBWrite Back
38COMP381 by M. Hamdi
Pipelining - Key Idea
• Question: What happens if we break execution into multiple cycles?
• Answer: in the best case, we can start executing a new instruction on each clock cycle - this is pipelining
• Pipelining stages:– IF - Instruction Fetch– ID - Instruction Decode– EX - Execute / Address Calculation– MEM - Memory Access (read / write)– WB - Write Back (results into register file)
39COMP381 by M. Hamdi
Pipeline RegistersPipeline Registers• Pipeline registers are named with 2 stages (the
stages that the register is “between.”)
• ANY information needed in a later pipeline stage MUST be passed via a pipeline register
–Example:IF/ID register gets
•instruction
•PC+4
• No register is needed after WB. Results from the WB stage are already stored in the register file, which serves as a pipeline register between instructions.
40COMP381 by M. Hamdi
Basic Pipelined Processor
IF/ID
Pipeline Registers
5 516
RD1
RD2
RN1 RN2 WN
WD
Register File ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
5
Instruction I32
MUX
<<2RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
32
ID/EX EX/MEM MEM/WB
41COMP381 by M. Hamdi
Single-Cycle vs. Pipelined Execution
Non-Pipelined0 200 400 600 800 1000 1200 1400 1600 1800
lw $1, 100($0) InstructionFetch
REGRD
ALU REGWR
MEM
lw $2, 200($0) InstructionFetch
REGRD
ALU REGWR
MEM
lw $3, 300($0) InstructionFetch
TimeInstructionOrder
800ps
800ps
800ps
Pipelined0 200 400 600 800 1000 1200 1400 1600
lw $1, 100($0) InstructionFetch
REGRD
ALU REGWR
MEM
lw $2, 200($0)
lw $3, 300($0)
TimeInstructionOrder
200psInstruction
FetchREGRD
ALU REGWR
MEM
InstructionFetch
REGRD
ALU REGWR
MEM
200ps
200ps 200ps 200ps 200ps 200ps
42COMP381 by M. Hamdi
Pipelined Example - Executing Multiple Instructions
• Consider the following instruction sequence:lw $r0, 10($r1)
sw $sr3, 20($r4)
add $r5, $r6, $r7
sub $r8, $r9, $r10
43COMP381 by M. Hamdi
Executing Multiple InstructionsClock Cycle 1
LW
5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
44COMP381 by M. Hamdi
5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
Executing Multiple InstructionsClock Cycle 2
LWSW
45COMP381 by M. Hamdi
5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
Executing Multiple InstructionsClock Cycle 3
LWSWADD
46COMP381 by M. Hamdi
5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
Executing Multiple InstructionsClock Cycle 4
LWSWADDSUB
47COMP381 by M. Hamdi
Executing Multiple InstructionsClock Cycle 5
5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
LWSWADDSUB
48COMP381 by M. Hamdi
Executing Multiple InstructionsClock Cycle 6
5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
SWADDSUB
49COMP381 by M. Hamdi
Executing Multiple InstructionsClock Cycle 7
5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
ADDSUB
50COMP381 by M. Hamdi
Executing Multiple InstructionsClock Cycle 8
5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
SUB
51COMP381 by M. Hamdi
Alternative View - Multicycle Diagram
IM REG ALU DM REGlw $r0, 10($r1)
sw $r3, 20($r4)
add $r5, $r6, $r7
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7
IM REG ALU DM REG
IM REG ALU DM REG
sub $r8, $r9, $r10 IM REG ALU DM REG
CC 8
52COMP381 by M. Hamdi
Pipelining: Design GoalsPipelining: Design Goals
• Two ways to view the performance mechanism
– Reduced CPI (i.e. non-piped to piped change)
• Close to 1 instruction/cycle if you’re lucky
– Reduced cycle-time (i.e. increasing pipeline depth)
• Work split into more stages
• Simpler stages result in faster clock cycles
53COMP381 by M. Hamdi
Pipelining Performance Pipelining Performance ExampleExample
• Example: For an unpipelined CPU: – Clock cycle = 1ns, 4 cycles for ALU operations and branches and 5 cycles for
memory operations with instruction frequencies of 40%, 20% and 40%, respectively.
– If pipelining adds 0.2 ns to the machine clock cycle then the speedup in instruction execution from pipelining is:
Non-pipelined Average instruction execution time = Clock cycle x Average CPI
= 1 ns x ((40% + 20%) x 4 + 40%x 5) = 1 ns x 4.4 = 4.4 ns
In the pipelined five implementation five stages are used with an average instruction execution time of: 1 ns + 0.2 ns = 1.2 ns
Speedup from pipelining = Instruction time unpipelined
Instruction time pipelined
= 4.4 ns / 1.2 ns = 3.7 times faster
54COMP381 by M. Hamdi
Pipeline Throughput and Pipeline Throughput and Latency:Latency:
A More realistic ExamplesA More realistic Examples
IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns
Consider the pipeline above with the indicateddelays. We want to know what is the pipelinethroughput and the pipeline latency.
Pipeline throughput: instructions completed per second.
Pipeline latency: how long does it take to execute a single instruction in the pipeline.
55COMP381 by M. Hamdi
Pipeline Throughput and Pipeline Throughput and LatencyLatency
IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns
Pipeline throughput: how often an instruction is completed.
)(10/1
4,10,5,4,5max/1
)(),(),(),(),(max/1
overheadregisterpipelineignoringnsinstr
nsnsnsnsnsinstr
WBlatMEMlatEXlatIDlatIFlatinstr
Pipeline latency: how long does it take to execute an instruction in the pipeline.
nsnsnsnsnsns
WBlatMEMlatEXlatIDlatIFlatL
28410545
)()()()()(
56COMP381 by M. Hamdi
Pipeline Throughput and Pipeline Throughput and LatencyLatency
IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns
Simply adding the latencies to compute the pipelinelatency, only would work for an isolated instruction
L(I5) = 43ns
IF MEMIDI1 L(I1) = 28nsEX WBMEMIDIFI2 L(I2) = 33nsEX WB
MEMIDIFI3 L(I3) = 38nsEX WBMEMIDIFI4 EX WB
We are in trouble! The latency is not constant.This happens because this is an unbalancedpipeline. The solution is to make every state
the same length as the longest one.
57COMP381 by M. Hamdi
Synchronous Pipeline Synchronous Pipeline Throughput and LatencyThroughput and Latency
IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns
The slowest pipeline stage also limits the latency!!
IF MEMIDI1
L(I1) = L(I2) = L(I3) = L(I4) = 50ns
EX WBIF MEMIDI2 L(I2) = 50nsEX WB
IF MEMID EX WBIF MEMID EX
0 10 20 30 40 50 60
I3I4
58COMP381 by M. Hamdi
Pipeline Throughput and Pipeline Throughput and LatencyLatency
IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns
How long does it take to execute (issue) 20000 instructionsin this pipeline? (disregard latency, bubbles caused bybranches, cache misses, hazards)
How long would it take using the same moduleswithout pipelining?
snsnsExecTime pipe 2002000001020000
snsnsExecTime pipenon 5605600002820000
59COMP381 by M. Hamdi
Pipeline Throughput and Pipeline Throughput and LatencyLatency
IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns
Thus the speedup that we got from the pipeline is:
8.2 200
560
s
s
ExecTime
ExecTimeSpeedup
pipe
pipenonpipe
How can we improve this pipeline design?
We need to reduce the unbalance to increasethe clock speed.
60COMP381 by M. Hamdi
Pipeline Throughput and Pipeline Throughput and LatencyLatency
IF ID EX MEM1 WB
5 ns 4 ns 5 ns 5 ns 4 ns
Now we have one more pipeline stage, but themaximum latency of a single stage is reduced in half.
MEM2
5 ns
nsinstr
nsnsnsnsnsnsinstr
WBlatMEMlatMEMlatEXlatIDlatIFlatinstrT
5/1
)4,5,5,5,4,5max(/1
)(),2(),1(),(),(),(max(/1
nsnsL 3056
The new latency for a single instruction is:
61COMP381 by M. Hamdi
Pipeline Throughput and Pipeline Throughput and LatencyLatency
IF ID EX MEM1 WB
5 ns 4 ns 5 ns 5 ns 4 ns
MEM2
5 ns
IF MEM1IDI1 EX WBMEM2IF MEM1IDI2 EX WBMEM2
IF MEM1IDI3 EX WBMEM2IF MEM1IDI4 EX WBMEM2
IF MEM1IDI5 EX WBMEM2IF MEM1IDI6 EX WBMEM2
IF MEM1IDI7 EX WBMEM2
62COMP381 by M. Hamdi
Pipeline Throughput and Pipeline Throughput and LatencyLatency
IF ID EX MEM1 WB
5 ns 4 ns 5 ns 5 ns 4 ns
MEM2
5 ns
snsnsExecTime pipe 100100000520000
How long does it take to execute 20000 instructionsin this pipeline? (disregard bubbles caused bybranches, cache misses, etc, for now)
Thus the speedup that we get from the pipeline is:
6.5 100
560
s
s
ExecTime
ExecTimeSpeedup
pipe
pipenonpipe
63COMP381 by M. Hamdi
Pipeline Throughput and Pipeline Throughput and LatencyLatency
IF ID EX MEM1 WB
5 ns 4 ns 5 ns 5 ns 4 ns
MEM2
5 ns
What have we learned from this example?
1. It is important to balance the delays in the stages of the pipeline
2. The throughput of a pipeline is 1/max(delay).
3. The latency is Nmax(delay), where N is the number of stages in the pipeline.
64COMP381 by M. Hamdi
Pipelining is Not That Easy for Pipelining is Not That Easy for ComputersComputers
• Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle
– Structural hazards: Arise from hardware resource conflicts when the available hardware cannot support all possible combinations of instructions.
– Data hazards: Arise when an instruction depends on the results of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline
– Control hazards: Arise from the pipelining of conditional branches and other instructions that change the PC
• A possible solution is to “stall” the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline