COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

64
1 COMP381 by M. Hamdi Improving Processor Performance with Pipelining Pipelining

Transcript of COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

1COMP381 by M. Hamdi

Improving Processor

Performance with

PipeliningPipelining

Introduction to PipeliningIntroduction to Pipelining• Pipelining: An implementation technique that overlaps the execution of

multiple instructions. It is a keykey technique in achieving high-performance

• Laundry ExampleLaundry Example• Ann, Brian, Cathy, Dave

each have one load of clothes to wash, dry, and fold

• Washer takes 30 minutes

• Dryer takes 40 minutes

• “Folder” takes 20 minutes

A B C D

Sequential LaundrySequential Laundry

• Sequential laundry takes 6 hours for 4 loads• If they learned pipelining, how long would laundry take?

A

B

C

D

30 40 20 30 40 20 30 40 20 30 40 20

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

Pipelined LaundryPipelined LaundryStart work ASAPStart work ASAP

• Pipelined laundry takes 3.5 hours for 4 loads • Speedup = 6/3.5 = 1.7

A

B

C

D

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

30 40 40 40 40 20

5COMP381 by M. Hamdi

Pipelining Lessons• Latency vs. Throughput• Question

– What is the latency in both cases ?

– What is the throughput in both cases ?

Pipelining doesn’t help latency of single task,

It helps throughput of entire workload

A

B

C

D

30 40 40 40 40 20

6COMP381 by M. Hamdi

Pipelining Lessons [contd…]

• Question– What is the fastest operation in the example ?– What is the slowest operation in the example

Pipeline rate limited by slowest pipeline stage

A

B

C

D

30 40 40 40 40 20

7COMP381 by M. Hamdi

Pipelining Lessons [contd…]

A

B

C

D

30 40 40 40 40 20Multiple tasks operating simultaneously using

different resources

8COMP381 by M. Hamdi

Pipelining Lessons [contd…]

• Question

– Would the speedup increase if we had more steps ?

A

B

C

D

30 40 40 40 40 20

Potential Speedup = Number of pipe stages

9COMP381 by M. Hamdi

Pipelining Lessons [contd…]

• Washer takes 30 minutes

• Dryer takes 40 minutes

• “Folder” takes 20 minutes

• Question– Will it affect if “Folder” also took 40 minutes

Unbalanced lengths of pipe stages reduces speedup

10COMP381 by M. Hamdi

Pipelining Lessons [contd…]

A

B

C

D

30 40 40 40 40 20

Time to “fill” pipeline and time to “drain” it reduces speedup

11COMP381 by M. Hamdi

Pipelining a Digital System

• Key idea: break big computation up into pieces

Separate each piece with a pipeline register1ns

200ps 200ps 200ps 200ps 200ps

PipelineRegister

12COMP381 by M. Hamdi

Pipelining a Digital System

• Why do this? Because it's faster for repeated computations

1ns

Non-pipelined:1 operation finishesevery 1ns

200ps 200ps 200ps 200ps 200ps

Pipelined:1 operation finishesevery 200ps

13COMP381 by M. Hamdi

Comments about pipelining

• Pipelining increases throughput, but not latency

– Answer available every 200ps, BUT

– A single computation still takes 1ns

• Limitations:

– Computations must be divisible into stages of equal sizes

– Pipeline registers add overhead

14COMP381 by M. Hamdi

Another Example

Comb.Logic

REG

30ns 3ns

Clock

Delay = 33nsThroughput = 30MHz

Time

UnpipelinedSystem

Op1 Op2 Op3??

– One operation must complete before next can begin– Operations spaced 33ns apart

15COMP381 by M. Hamdi

3 Stage Pipelining

– Space operations 13ns apart

– 3 operations occur simultaneously

REG

Clock

Comb.Logic

REG

Comb.Logic

REG

Comb.Logic

10ns 3ns 10ns 3ns 10ns 3ns

Delay = 39nsThroughput = 77MHz

Time

Op1

Op2

Op3

Op4

16COMP381 by M. Hamdi

Limitation: Nonuniform Pipelining

Clock

REG

Com.Log.

REG

Comb.Logic

REG

Comb.Logic

5ns 3ns 15ns 3ns 10ns 3ns

Delay = 18 * 3 = 54 nsThroughput = 55MHz

• Throughput limited by slowest stage• Delay determined by clock period * number of stages

• Must attempt to balance stages

17COMP381 by M. Hamdi

Limitation: Deep Pipelines

• Diminishing returns as add more pipeline stages

• Register delays become limiting factor• Increased latency

• Small throughput gains

• More hazards

Delay = 48ns, Throughput = 128MHzClock

REG

Com.Log.

5ns 3ns

REG

Com.Log.

5ns 3ns

REG

Com.Log.

5ns 3ns

REG

Com.Log.

5ns 3ns

REG

Com.Log.

5ns 3ns

REG

Com.Log.

5ns 3ns

18COMP381 by M. Hamdi

Computer (Processor) PipeliningComputer (Processor) Pipelining

• It is one KEYKEY method of achieving High-Performance in modern microprocessors

• It is being used in many different designs (not just processors)– http://www.siliconstrategies.com/story/OEG20020820S0054

• It is a completely hardware mechanism

• A major advantage of pipelining over “parallel processing” is that it is not visiblenot visible to the programmer

• An instructioninstruction execution pipeline involves a number of steps, where each step completes a part of an instruction.

• Each step is called a pipe stage or a pipe segment.

19COMP381 by M. Hamdi

Instr 2

Pipelining• Multiple instructions overlapped in execution• Throughput optimization: doesn’t reduce time

for individual instructions

Instr 1

Stage 2Stage 3Stage 4Stage 5Stage 6Stage 7Stage 1

Instr 3Instr 2Instr 1

Stage 2Stage 3Stage 4Stage 5Stage 6Stage 7Stage 1

20COMP381 by M. Hamdi

Computer PipeliningComputer Pipelining

• The stages or steps are connected one to the next to form a pipe -- instructions enter at one end and progress through the stage and exit at the other end.

• ThroughputThroughput of an instruction pipeline is determined by how often an instruction exists the pipeline.

• The time to move an instruction one step down the line is equal to the machine cyclethe machine cycle (Clock Rate) and is determined by the stage with the longest processing delay (slowest pipeline stage).

21COMP381 by M. Hamdi

Pipelining: Design GoalsPipelining: Design Goals• An important pipeline design consideration is to

balance the length of each pipeline stage.

• If all stages are perfectly balanced, then the time per instruction on a pipelined machine (assuming ideal conditions with no stalls):

Time per instruction on unpipelined Time per instruction on unpipelined machinemachine

Number of pipe stagesNumber of pipe stages

• Under these ideal conditions:– Speedup from pipelining equals the number of pipeline

stages: n, – One instruction is completed every cycle, CPI = 1 .

22COMP381 by M. Hamdi

Pipelining: Design GoalsPipelining: Design Goals• Under these ideal conditions:

– Speedup from pipelining equals the number of pipeline stages: n,

– One instruction is completed every cycle, CPI = 1 .

– This is an asymptote of course, but +10% is commonly achieved

– Difference is due to difficulty in achieving balanced stage design

• Two ways to view the performance mechanism– Reduced CPI (i.e. non-piped to piped change)

• Close to 1 instruction/cycle if you’re lucky

– Reduced cycle-time (i.e. increasing pipeline depth)• Work split into more stages

• Simpler stages result in faster clock cycles

23COMP381 by M. Hamdi

Implementation of MIPSImplementation of MIPS

• We use the MIPS processor as an example to demonstrate the concepts of computer pipelining.

• MIPS ISA is designed based on sound measurements and sound architectural considerations (as covered in class).

• It is used by numerous companies (Nintendo and Playstation) through liscencing agreements.

• These same concepts are being used by ALL other processors as well.

24COMP381 by M. Hamdi

MIPS64 Instruction MIPS64 Instruction FormatFormat 166 5 5

ImmediatertrsOpcode

6 5 5 5 5

Opcode rs rt rd func

6 26

Opcode Offset added to PC

J - Type instruction

R - type instruction

Jump and jump and link. Trap and return from exception

Register-register ALU operations: rd rs func rt Function encodes the data path operation: Add, Sub .. Read/write special registers and moves.

Encodes: Loads and stores of bytes, words, half words. All immediates (rd rs op immediate)Conditional branch instructions (rs1 is register, rd unused)Jump register, jump and link register (rd = 0, rs = destination, immediate = 0)

I - type instruction

6

shamt

0 5 6 10 11 15 16 31

0 5 6 10 11 15 16 20 21 25 26 31

0 5 6 31

25COMP381 by M. Hamdi

A Basic Multi-Cycle A Basic Multi-Cycle Implementation of MIPSImplementation of MIPS

• Every integer MIPS instruction can be implemented in at most five clock cycles (branch – 2 cycles, Store – 4 cycles, other – 5 cycles):

1 Instruction fetch cycle (IF):

IR Mem[PC]NPC PC + 4

2 Instruction decode/register fetch cycle (ID):

A Regs[rs];B Regs[rt];

Imm ((IR16)16##IR 16..31) sign-extended immediate field of IR

Note: IR (instruction register), NPC (next sequential program counter register) A, B, Imm are temporary registers

26COMP381 by M. Hamdi

A Basic Implementation of MIPS A Basic Implementation of MIPS (continued)(continued)

3 Execution/Effective address cycle (EX):

– Memory reference:

ALUOutput A + Imm;

– Register-Register ALU instruction:

ALUOutput A op B;

– Register-Immediate ALU instruction:

ALUOutput A op Imm;

– Branch:

ALUOutput NPC + Imm;

Cond (A == 0)

27COMP381 by M. Hamdi

A Basic Implementation of MIPS (continued)A Basic Implementation of MIPS (continued)

4 Memory access/branch completion cycle (MEM):

– Memory reference:

LMD Mem[ALUOutput] orMem[ALUOutput] B;

– Branch:

if (cond) PC ALUOutput else PC NPC

Note: LMD (load memory data) register

28COMP381 by M. Hamdi

A Basic Implementation of MIPS A Basic Implementation of MIPS (continued)(continued)

5 Write-back cycle (WB):

– Register-Register ALU instruction:

Regs[rd] ALUOutput;

– Register-Immediate ALU instruction:

Regs[rt] ALUOutput;

– Load instruction:

Regs[rt] LMD;

Note: LMD (load memory data) register

29COMP381 by M. Hamdi

Basic MIPS Multi-Cycle Integer Datapath ImplementationBasic MIPS Multi-Cycle Integer Datapath Implementation

30COMP381 by M. Hamdi

Simple MIPS Pipelined Simple MIPS Pipelined Integer Instruction Integer Instruction

ProcessingProcessing Clock Number Time in clock cycles Instruction Number 1 2 3 4 5 6 7 8 9

Instruction I IF ID EX MEM WBInstruction I+1 IF ID EX MEM WBInstruction I+2 IF ID EX MEM WBInstruction I+3 IF ID EX MEM WBInstruction I +4 IF ID EX MEM WB

Time to fill the pipeline

MIPS Pipeline Stages:

IF = Instruction FetchID = Instruction DecodeEX = ExecutionMEM = Memory AccessWB = Write Back

First instruction, ICompleted

Last instruction, I+4 completed

31COMP381 by M. Hamdi

Pipelining The MIPS Processor

• There are 5 steps in instruction execution:

1. Instruction Fetch

2. Instruction Decode and Register Read

3. Execution operation or calculate address

4. Memory access

5. Write result into register

32COMP381 by M. Hamdi

Datapath for Instruction Fetch

Instruction <- MEM[PC]PC <- PC + 4

RDMemory

ADDR

PC

Instruction

4

ADD

33COMP381 by M. Hamdi

Datapath for R-Type Instructions

add rd, rs, rtR[rd] <- R[rs] + R[rt];

5 5 5

RD1

RD2

RN1 RN2 WN

WD

RegWrite

Register File

op rs rt rd functshamt

Operation

ALU Zero

Instruction

3

34COMP381 by M. Hamdi

Datapath for Load/Store Instructions

op rs rt offset/immediate

5 5

16

RD1

RD2

RN1 RN2 WN

WD

RegWrite

Register File

Operation

ALU

3

EXTND

16 32

Zero

RDWD

MemRead

MemoryADDR

MemWrite

5

lw rt, offset(rs)R[rt] <- MEM[R[rs] + s_extend(offset)];

35COMP381 by M. Hamdi

Datapath for Load/Store Instructions

op rs rt offset/immediate

5 5

16

RD1

RD2

RN1 RN2 WN

WD

RegWrite

Register File

Operation

ALU

3

EXTND

16 32

Zero

RDWD

MemRead

MemoryADDR

MemWrite

5

sw rt, offset(rs)MEM[R[rs] + sign_extend(offset)] <- R[rt]

36COMP381 by M. Hamdi

Datapath for Branch Instructions

beq rs, rt, offset

op rs rt offset/immediate

5 5

16

RD1

RD2

RN1 RN2 WN

WD

RegWrite

Register File

Operation

ALU

EXTND

16 32

Zero

ADD

<<2

PC +4 from instruction datapath

if (R[rs] == R[rt]) then PC <- PC+4 + s_extend(offset<<2)

37COMP381 by M. Hamdi

Single-Cycle Processor

5 516

RD1

RD2

RN1 RN2 WN

WD

Register File ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

5

Instruction I32

MUX

<<2RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

32

IFInstruction Fetch

IDInstruction Decode

EXExecute/ Address Calc.

MEMMemory Access

WBWrite Back

38COMP381 by M. Hamdi

Pipelining - Key Idea

• Question: What happens if we break execution into multiple cycles?

• Answer: in the best case, we can start executing a new instruction on each clock cycle - this is pipelining

• Pipelining stages:– IF - Instruction Fetch– ID - Instruction Decode– EX - Execute / Address Calculation– MEM - Memory Access (read / write)– WB - Write Back (results into register file)

39COMP381 by M. Hamdi

Pipeline RegistersPipeline Registers• Pipeline registers are named with 2 stages (the

stages that the register is “between.”)

• ANY information needed in a later pipeline stage MUST be passed via a pipeline register

–Example:IF/ID register gets

•instruction

•PC+4

• No register is needed after WB. Results from the WB stage are already stored in the register file, which serves as a pipeline register between instructions.

40COMP381 by M. Hamdi

Basic Pipelined Processor

IF/ID

Pipeline Registers

5 516

RD1

RD2

RN1 RN2 WN

WD

Register File ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

5

Instruction I32

MUX

<<2RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

32

ID/EX EX/MEM MEM/WB

41COMP381 by M. Hamdi

Single-Cycle vs. Pipelined Execution

Non-Pipelined0 200 400 600 800 1000 1200 1400 1600 1800

lw $1, 100($0) InstructionFetch

REGRD

ALU REGWR

MEM

lw $2, 200($0) InstructionFetch

REGRD

ALU REGWR

MEM

lw $3, 300($0) InstructionFetch

TimeInstructionOrder

800ps

800ps

800ps

Pipelined0 200 400 600 800 1000 1200 1400 1600

lw $1, 100($0) InstructionFetch

REGRD

ALU REGWR

MEM

lw $2, 200($0)

lw $3, 300($0)

TimeInstructionOrder

200psInstruction

FetchREGRD

ALU REGWR

MEM

InstructionFetch

REGRD

ALU REGWR

MEM

200ps

200ps 200ps 200ps 200ps 200ps

42COMP381 by M. Hamdi

Pipelined Example - Executing Multiple Instructions

• Consider the following instruction sequence:lw $r0, 10($r1)

sw $sr3, 20($r4)

add $r5, $r6, $r7

sub $r8, $r9, $r10

43COMP381 by M. Hamdi

Executing Multiple InstructionsClock Cycle 1

LW

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

44COMP381 by M. Hamdi

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

Executing Multiple InstructionsClock Cycle 2

LWSW

45COMP381 by M. Hamdi

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

Executing Multiple InstructionsClock Cycle 3

LWSWADD

46COMP381 by M. Hamdi

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

Executing Multiple InstructionsClock Cycle 4

LWSWADDSUB

47COMP381 by M. Hamdi

Executing Multiple InstructionsClock Cycle 5

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

LWSWADDSUB

48COMP381 by M. Hamdi

Executing Multiple InstructionsClock Cycle 6

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

SWADDSUB

49COMP381 by M. Hamdi

Executing Multiple InstructionsClock Cycle 7

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

ADDSUB

50COMP381 by M. Hamdi

Executing Multiple InstructionsClock Cycle 8

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

SUB

51COMP381 by M. Hamdi

Alternative View - Multicycle Diagram

IM REG ALU DM REGlw $r0, 10($r1)

sw $r3, 20($r4)

add $r5, $r6, $r7

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7

IM REG ALU DM REG

IM REG ALU DM REG

sub $r8, $r9, $r10 IM REG ALU DM REG

CC 8

52COMP381 by M. Hamdi

Pipelining: Design GoalsPipelining: Design Goals

• Two ways to view the performance mechanism

– Reduced CPI (i.e. non-piped to piped change)

• Close to 1 instruction/cycle if you’re lucky

– Reduced cycle-time (i.e. increasing pipeline depth)

• Work split into more stages

• Simpler stages result in faster clock cycles

53COMP381 by M. Hamdi

Pipelining Performance Pipelining Performance ExampleExample

• Example: For an unpipelined CPU: – Clock cycle = 1ns, 4 cycles for ALU operations and branches and 5 cycles for

memory operations with instruction frequencies of 40%, 20% and 40%, respectively.

– If pipelining adds 0.2 ns to the machine clock cycle then the speedup in instruction execution from pipelining is:

Non-pipelined Average instruction execution time = Clock cycle x Average CPI

= 1 ns x ((40% + 20%) x 4 + 40%x 5) = 1 ns x 4.4 = 4.4 ns

In the pipelined five implementation five stages are used with an average instruction execution time of: 1 ns + 0.2 ns = 1.2 ns

Speedup from pipelining = Instruction time unpipelined

Instruction time pipelined

= 4.4 ns / 1.2 ns = 3.7 times faster

54COMP381 by M. Hamdi

Pipeline Throughput and Pipeline Throughput and Latency:Latency:

A More realistic ExamplesA More realistic Examples

IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns

Consider the pipeline above with the indicateddelays. We want to know what is the pipelinethroughput and the pipeline latency.

Pipeline throughput: instructions completed per second.

Pipeline latency: how long does it take to execute a single instruction in the pipeline.

55COMP381 by M. Hamdi

Pipeline Throughput and Pipeline Throughput and LatencyLatency

IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns

Pipeline throughput: how often an instruction is completed.

)(10/1

4,10,5,4,5max/1

)(),(),(),(),(max/1

overheadregisterpipelineignoringnsinstr

nsnsnsnsnsinstr

WBlatMEMlatEXlatIDlatIFlatinstr

Pipeline latency: how long does it take to execute an instruction in the pipeline.

nsnsnsnsnsns

WBlatMEMlatEXlatIDlatIFlatL

28410545

)()()()()(

56COMP381 by M. Hamdi

Pipeline Throughput and Pipeline Throughput and LatencyLatency

IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns

Simply adding the latencies to compute the pipelinelatency, only would work for an isolated instruction

L(I5) = 43ns

IF MEMIDI1 L(I1) = 28nsEX WBMEMIDIFI2 L(I2) = 33nsEX WB

MEMIDIFI3 L(I3) = 38nsEX WBMEMIDIFI4 EX WB

We are in trouble! The latency is not constant.This happens because this is an unbalancedpipeline. The solution is to make every state

the same length as the longest one.

57COMP381 by M. Hamdi

Synchronous Pipeline Synchronous Pipeline Throughput and LatencyThroughput and Latency

IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns

The slowest pipeline stage also limits the latency!!

IF MEMIDI1

L(I1) = L(I2) = L(I3) = L(I4) = 50ns

EX WBIF MEMIDI2 L(I2) = 50nsEX WB

IF MEMID EX WBIF MEMID EX

0 10 20 30 40 50 60

I3I4

58COMP381 by M. Hamdi

Pipeline Throughput and Pipeline Throughput and LatencyLatency

IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns

How long does it take to execute (issue) 20000 instructionsin this pipeline? (disregard latency, bubbles caused bybranches, cache misses, hazards)

How long would it take using the same moduleswithout pipelining?

snsnsExecTime pipe 2002000001020000

snsnsExecTime pipenon 5605600002820000

59COMP381 by M. Hamdi

Pipeline Throughput and Pipeline Throughput and LatencyLatency

IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns

Thus the speedup that we got from the pipeline is:

8.2 200

560

s

s

ExecTime

ExecTimeSpeedup

pipe

pipenonpipe

How can we improve this pipeline design?

We need to reduce the unbalance to increasethe clock speed.

60COMP381 by M. Hamdi

Pipeline Throughput and Pipeline Throughput and LatencyLatency

IF ID EX MEM1 WB

5 ns 4 ns 5 ns 5 ns 4 ns

Now we have one more pipeline stage, but themaximum latency of a single stage is reduced in half.

MEM2

5 ns

nsinstr

nsnsnsnsnsnsinstr

WBlatMEMlatMEMlatEXlatIDlatIFlatinstrT

5/1

)4,5,5,5,4,5max(/1

)(),2(),1(),(),(),(max(/1

nsnsL 3056

The new latency for a single instruction is:

61COMP381 by M. Hamdi

Pipeline Throughput and Pipeline Throughput and LatencyLatency

IF ID EX MEM1 WB

5 ns 4 ns 5 ns 5 ns 4 ns

MEM2

5 ns

IF MEM1IDI1 EX WBMEM2IF MEM1IDI2 EX WBMEM2

IF MEM1IDI3 EX WBMEM2IF MEM1IDI4 EX WBMEM2

IF MEM1IDI5 EX WBMEM2IF MEM1IDI6 EX WBMEM2

IF MEM1IDI7 EX WBMEM2

62COMP381 by M. Hamdi

Pipeline Throughput and Pipeline Throughput and LatencyLatency

IF ID EX MEM1 WB

5 ns 4 ns 5 ns 5 ns 4 ns

MEM2

5 ns

snsnsExecTime pipe 100100000520000

How long does it take to execute 20000 instructionsin this pipeline? (disregard bubbles caused bybranches, cache misses, etc, for now)

Thus the speedup that we get from the pipeline is:

6.5 100

560

s

s

ExecTime

ExecTimeSpeedup

pipe

pipenonpipe

63COMP381 by M. Hamdi

Pipeline Throughput and Pipeline Throughput and LatencyLatency

IF ID EX MEM1 WB

5 ns 4 ns 5 ns 5 ns 4 ns

MEM2

5 ns

What have we learned from this example?

1. It is important to balance the delays in the stages of the pipeline

2. The throughput of a pipeline is 1/max(delay).

3. The latency is Nmax(delay), where N is the number of stages in the pipeline.

64COMP381 by M. Hamdi

Pipelining is Not That Easy for Pipelining is Not That Easy for ComputersComputers

• Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle

– Structural hazards: Arise from hardware resource conflicts when the available hardware cannot support all possible combinations of instructions.

– Data hazards: Arise when an instruction depends on the results of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline

– Control hazards: Arise from the pipelining of conditional branches and other instructions that change the PC

• A possible solution is to “stall” the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline