COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

1COMP381 by M. Hamdi

Improving Processor

Performance with

PipeliningPipelining

Introduction to PipeliningIntroduction to Pipelining• Pipelining: An implementation technique that overlaps the execution of

multiple instructions. It is a keykey technique in achieving high-performance

• Laundry ExampleLaundry Example• Ann, Brian, Cathy, Dave

each have one load of clothes to wash, dry, and fold

• Washer takes 30 minutes

• Dryer takes 40 minutes

• “Folder” takes 20 minutes

A B C D

Sequential LaundrySequential Laundry

• Sequential laundry takes 6 hours for 4 loads• If they learned pipelining, how long would laundry take?

A

B

C

D

30 40 20 30 40 20 30 40 20 30 40 20

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

Pipelined LaundryPipelined LaundryStart work ASAPStart work ASAP

• Pipelined laundry takes 3.5 hours for 4 loads • Speedup = 6/3.5 = 1.7

A

B

C

D

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

30 40 40 40 40 20


Pipelining Lessons• Latency vs. Throughput• Question

– What is the latency in both cases ?

– What is the throughput in both cases ?

Pipelining doesn’t help latency of single task,

It helps throughput of entire workload

A

B

C

D

30 40 40 40 40 20


Pipelining Lessons [contd…]

• Question– What is the fastest operation in the example ?– What is the slowest operation in the example

Pipeline rate limited by slowest pipeline stage

A

B

C

D

30 40 40 40 40 20



A

B

C

D

30 40 40 40 40 20Multiple tasks operating simultaneously using

different resources



• Question

– Would the speedup increase if we had more steps ?

A

B

C

D

30 40 40 40 40 20

Potential Speedup = Number of pipe stages



• Washer takes 30 minutes

• Dryer takes 40 minutes

• “Folder” takes 20 minutes

• Question– Will it affect if “Folder” also took 40 minutes

Unbalanced lengths of pipe stages reduces speedup



A

B

C

D

30 40 40 40 40 20

Time to “fill” pipeline and time to “drain” it reduces speedup


Pipelining a Digital System

• Key idea: break big computation up into pieces

Separate each piece with a pipeline register1ns

200ps 200ps 200ps 200ps 200ps

PipelineRegister


Pipelining a Digital System

• Why do this? Because it's faster for repeated computations

1ns

Non-pipelined:1 operation finishesevery 1ns

200ps 200ps 200ps 200ps 200ps

Pipelined:1 operation finishesevery 200ps


Comments about pipelining

• Pipelining increases throughput, but not latency

– Answer available every 200ps, BUT

– A single computation still takes 1ns

• Limitations:

– Computations must be divisible into stages of equal sizes

– Pipeline registers add overhead


Another Example

Comb.Logic

REG

30ns 3ns

Clock

Delay = 33nsThroughput = 30MHz

Time

UnpipelinedSystem

Op1 Op2 Op3??

– One operation must complete before next can begin– Operations spaced 33ns apart


3 Stage Pipelining

– Space operations 13ns apart

– 3 operations occur simultaneously

REG

Clock

Comb.Logic

REG

Comb.Logic

REG

Comb.Logic

10ns 3ns 10ns 3ns 10ns 3ns

Delay = 39nsThroughput = 77MHz

Time

Op1

Op2

Op3

Op4


Limitation: Nonuniform Pipelining

Clock

REG

Com.Log.

REG

Comb.Logic

REG

Comb.Logic

5ns 3ns 15ns 3ns 10ns 3ns

Delay = 18 * 3 = 54 nsThroughput = 55MHz

• Throughput limited by slowest stage• Delay determined by clock period * number of stages

• Must attempt to balance stages


Limitation: Deep Pipelines

• Diminishing returns as add more pipeline stages

• Register delays become limiting factor• Increased latency

• Small throughput gains

• More hazards

Delay = 48ns, Throughput = 128MHzClock

REG

Com.Log.

5ns 3ns

REG

Com.Log.

5ns 3ns

REG

Com.Log.

5ns 3ns

REG

Com.Log.

5ns 3ns

REG

Com.Log.

5ns 3ns

REG

Com.Log.

5ns 3ns


Computer (Processor) PipeliningComputer (Processor) Pipelining

• It is one KEYKEY method of achieving High-Performance in modern microprocessors

• It is being used in many different designs (not just processors)– http://www.siliconstrategies.com/story/OEG20020820S0054

• It is a completely hardware mechanism

• A major advantage of pipelining over “parallel processing” is that it is not visiblenot visible to the programmer

• An instructioninstruction execution pipeline involves a number of steps, where each step completes a part of an instruction.

• Each step is called a pipe stage or a pipe segment.


Instr 2

Pipelining• Multiple instructions overlapped in execution• Throughput optimization: doesn’t reduce time

for individual instructions

Instr 1

Stage 2Stage 3Stage 4Stage 5Stage 6Stage 7Stage 1

Instr 3Instr 2Instr 1

Stage 2Stage 3Stage 4Stage 5Stage 6Stage 7Stage 1


Computer PipeliningComputer Pipelining

• The stages or steps are connected one to the next to form a pipe -- instructions enter at one end and progress through the stage and exit at the other end.

• ThroughputThroughput of an instruction pipeline is determined by how often an instruction exists the pipeline.

• The time to move an instruction one step down the line is equal to the machine cyclethe machine cycle (Clock Rate) and is determined by the stage with the longest processing delay (slowest pipeline stage).


Pipelining: Design GoalsPipelining: Design Goals• An important pipeline design consideration is to

balance the length of each pipeline stage.

• If all stages are perfectly balanced, then the time per instruction on a pipelined machine (assuming ideal conditions with no stalls):

Time per instruction on unpipelined Time per instruction on unpipelined machinemachine

Number of pipe stagesNumber of pipe stages

• Under these ideal conditions:– Speedup from pipelining equals the number of pipeline

stages: n, – One instruction is completed every cycle, CPI = 1 .


Pipelining: Design GoalsPipelining: Design Goals• Under these ideal conditions:

– Speedup from pipelining equals the number of pipeline stages: n,

– One instruction is completed every cycle, CPI = 1 .

– This is an asymptote of course, but +10% is commonly achieved

– Difference is due to difficulty in achieving balanced stage design

• Two ways to view the performance mechanism– Reduced CPI (i.e. non-piped to piped change)

• Close to 1 instruction/cycle if you’re lucky

– Reduced cycle-time (i.e. increasing pipeline depth)• Work split into more stages

• Simpler stages result in faster clock cycles


Implementation of MIPSImplementation of MIPS

• We use the MIPS processor as an example to demonstrate the concepts of computer pipelining.

• MIPS ISA is designed based on sound measurements and sound architectural considerations (as covered in class).

• It is used by numerous companies (Nintendo and Playstation) through liscencing agreements.

• These same concepts are being used by ALL other processors as well.


MIPS64 Instruction MIPS64 Instruction FormatFormat 166 5 5

ImmediatertrsOpcode

6 5 5 5 5

Opcode rs rt rd func

6 26

Opcode Offset added to PC

J - Type instruction

R - type instruction

Jump and jump and link. Trap and return from exception

Register-register ALU operations: rd rs func rt Function encodes the data path operation: Add, Sub .. Read/write special registers and moves.

Encodes: Loads and stores of bytes, words, half words. All immediates (rd rs op immediate)Conditional branch instructions (rs1 is register, rd unused)Jump register, jump and link register (rd = 0, rs = destination, immediate = 0)

I - type instruction

6

shamt

0 5 6 10 11 15 16 31

0 5 6 10 11 15 16 20 21 25 26 31

0 5 6 31


A Basic Multi-Cycle A Basic Multi-Cycle Implementation of MIPSImplementation of MIPS

• Every integer MIPS instruction can be implemented in at most five clock cycles (branch – 2 cycles, Store – 4 cycles, other – 5 cycles):

1 Instruction fetch cycle (IF):

IR Mem[PC]NPC PC + 4

2 Instruction decode/register fetch cycle (ID):

A Regs[rs];B Regs[rt];

Imm ((IR16)16##IR 16..31) sign-extended immediate field of IR

Note: IR (instruction register), NPC (next sequential program counter register) A, B, Imm are temporary registers


A Basic Implementation of MIPS A Basic Implementation of MIPS (continued)(continued)

3 Execution/Effective address cycle (EX):

– Memory reference:

ALUOutput A + Imm;

– Register-Register ALU instruction:

ALUOutput A op B;

– Register-Immediate ALU instruction:

ALUOutput A op Imm;

– Branch:

ALUOutput NPC + Imm;

Cond (A == 0)


A Basic Implementation of MIPS (continued)A Basic Implementation of MIPS (continued)

4 Memory access/branch completion cycle (MEM):

– Memory reference:

LMD Mem[ALUOutput] orMem[ALUOutput] B;

– Branch:

if (cond) PC ALUOutput else PC NPC

Note: LMD (load memory data) register


A Basic Implementation of MIPS A Basic Implementation of MIPS (continued)(continued)

5 Write-back cycle (WB):

– Register-Register ALU instruction:

Regs[rd] ALUOutput;

– Register-Immediate ALU instruction:

Regs[rt] ALUOutput;

– Load instruction:

Regs[rt] LMD;

Note: LMD (load memory data) register


Basic MIPS Multi-Cycle Integer Datapath ImplementationBasic MIPS Multi-Cycle Integer Datapath Implementation


Simple MIPS Pipelined Simple MIPS Pipelined Integer Instruction Integer Instruction

ProcessingProcessing Clock Number Time in clock cycles Instruction Number 1 2 3 4 5 6 7 8 9

Instruction I IF ID EX MEM WBInstruction I+1 IF ID EX MEM WBInstruction I+2 IF ID EX MEM WBInstruction I+3 IF ID EX MEM WBInstruction I +4 IF ID EX MEM WB

Time to fill the pipeline

MIPS Pipeline Stages:

IF = Instruction FetchID = Instruction DecodeEX = ExecutionMEM = Memory AccessWB = Write Back

First instruction, ICompleted

Last instruction, I+4 completed


Pipelining The MIPS Processor

• There are 5 steps in instruction execution:

1. Instruction Fetch

2. Instruction Decode and Register Read

3. Execution operation or calculate address

4. Memory access

5. Write result into register


Datapath for Instruction Fetch

Instruction <- MEM[PC]PC <- PC + 4

RDMemory

ADDR

PC

Instruction

4

ADD


Datapath for R-Type Instructions

add rd, rs, rtR[rd] <- R[rs] + R[rt];

5 5 5

RD1

RD2

RN1 RN2 WN

WD

RegWrite

Register File

op rs rt rd functshamt

Operation

ALU Zero

Instruction

3


Datapath for Load/Store Instructions

op rs rt offset/immediate

5 5

16

RD1

RD2

RN1 RN2 WN

WD

RegWrite

Register File

Operation

ALU

3

EXTND

16 32

Zero

RDWD

MemRead

MemoryADDR

MemWrite

5

lw rt, offset(rs)R[rt] <- MEM[R[rs] + s_extend(offset)];


Datapath for Load/Store Instructions


5 5

16

RD1

RD2

RN1 RN2 WN

WD

RegWrite

Register File

Operation

ALU

3

EXTND

16 32

Zero

RDWD

MemRead

MemoryADDR

MemWrite

5

sw rt, offset(rs)MEM[R[rs] + sign_extend(offset)] <- R[rt]


Datapath for Branch Instructions

beq rs, rt, offset


5 5

16

RD1

RD2

RN1 RN2 WN

WD

RegWrite

Register File

Operation

ALU

EXTND

16 32

Zero

ADD

<<2

PC +4 from instruction datapath

if (R[rs] == R[rt]) then PC <- PC+4 + s_extend(offset<<2)


Single-Cycle Processor

5 516

RD1

RD2

RN1 RN2 WN

WD

Register File ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

5

Instruction I32

MUX

<<2RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

32

IFInstruction Fetch

IDInstruction Decode

EXExecute/ Address Calc.

MEMMemory Access

WBWrite Back


Pipelining - Key Idea

• Question: What happens if we break execution into multiple cycles?

• Answer: in the best case, we can start executing a new instruction on each clock cycle - this is pipelining

• Pipelining stages:– IF - Instruction Fetch– ID - Instruction Decode– EX - Execute / Address Calculation– MEM - Memory Access (read / write)– WB - Write Back (results into register file)


Pipeline RegistersPipeline Registers• Pipeline registers are named with 2 stages (the

stages that the register is “between.”)

• ANY information needed in a later pipeline stage MUST be passed via a pipeline register

–Example:IF/ID register gets

•instruction

•PC+4

• No register is needed after WB. Results from the WB stage are already stored in the register file, which serves as a pipeline register between instructions.


Basic Pipelined Processor

IF/ID

Pipeline Registers

5 516

RD1

RD2

RN1 RN2 WN

WD

Register File ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

5

Instruction I32

MUX

<<2RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

32

ID/EX EX/MEM MEM/WB


Single-Cycle vs. Pipelined Execution

Non-Pipelined0 200 400 600 800 1000 1200 1400 1600 1800

lw $1, 100($0) InstructionFetch

REGRD

ALU REGWR

MEM


REGRD

ALU REGWR

MEM


TimeInstructionOrder

800ps

800ps

800ps

Pipelined0 200 400 600 800 1000 1200 1400 1600


REGRD

ALU REGWR

MEM

lw $2, 200($0)

lw $3, 300($0)

TimeInstructionOrder

200psInstruction

FetchREGRD

ALU REGWR

MEM

InstructionFetch

REGRD

ALU REGWR

MEM

200ps

200ps 200ps 200ps 200ps 200ps


Pipelined Example - Executing Multiple Instructions

• Consider the following instruction sequence:lw $r0, 10($r1)

sw $sr3, 20($r4)

add $r5, $r6, $r7

sub $r8, $r9, $r10


Executing Multiple InstructionsClock Cycle 1

LW

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero


5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5


Zero


LWSW


5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5


Zero


LWSWADD


5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5


Zero


LWSWADDSUB



5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5


Zero

LWSWADDSUB



5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5


Zero

SWADDSUB



5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5


Zero

ADDSUB



5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5


Zero

SUB


Alternative View - Multicycle Diagram

IM REG ALU DM REGlw $r0, 10($r1)

sw $r3, 20($r4)

add $r5, $r6, $r7

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7

IM REG ALU DM REG

IM REG ALU DM REG

sub $r8, $r9, $r10 IM REG ALU DM REG

CC 8


Pipelining: Design GoalsPipelining: Design Goals

• Two ways to view the performance mechanism

– Reduced CPI (i.e. non-piped to piped change)

• Close to 1 instruction/cycle if you’re lucky

– Reduced cycle-time (i.e. increasing pipeline depth)

• Work split into more stages

• Simpler stages result in faster clock cycles


Pipelining Performance Pipelining Performance ExampleExample

• Example: For an unpipelined CPU: – Clock cycle = 1ns, 4 cycles for ALU operations and branches and 5 cycles for

memory operations with instruction frequencies of 40%, 20% and 40%, respectively.

– If pipelining adds 0.2 ns to the machine clock cycle then the speedup in instruction execution from pipelining is:

Non-pipelined Average instruction execution time = Clock cycle x Average CPI

= 1 ns x ((40% + 20%) x 4 + 40%x 5) = 1 ns x 4.4 = 4.4 ns

In the pipelined five implementation five stages are used with an average instruction execution time of: 1 ns + 0.2 ns = 1.2 ns

Speedup from pipelining = Instruction time unpipelined

Instruction time pipelined

= 4.4 ns / 1.2 ns = 3.7 times faster


Pipeline Throughput and Pipeline Throughput and Latency:Latency:

A More realistic ExamplesA More realistic Examples

IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns

Consider the pipeline above with the indicateddelays. We want to know what is the pipelinethroughput and the pipeline latency.

Pipeline throughput: instructions completed per second.

Pipeline latency: how long does it take to execute a single instruction in the pipeline.


Pipeline Throughput and Pipeline Throughput and LatencyLatency


Pipeline throughput: how often an instruction is completed.

)(10/1

4,10,5,4,5max/1

)(),(),(),(),(max/1

overheadregisterpipelineignoringnsinstr

nsnsnsnsnsinstr

WBlatMEMlatEXlatIDlatIFlatinstr

Pipeline latency: how long does it take to execute an instruction in the pipeline.

nsnsnsnsnsns

WBlatMEMlatEXlatIDlatIFlatL

28410545

)()()()()(




Simply adding the latencies to compute the pipelinelatency, only would work for an isolated instruction

L(I5) = 43ns

IF MEMIDI1 L(I1) = 28nsEX WBMEMIDIFI2 L(I2) = 33nsEX WB

MEMIDIFI3 L(I3) = 38nsEX WBMEMIDIFI4 EX WB

We are in trouble! The latency is not constant.This happens because this is an unbalancedpipeline. The solution is to make every state

the same length as the longest one.


Synchronous Pipeline Synchronous Pipeline Throughput and LatencyThroughput and Latency


The slowest pipeline stage also limits the latency!!

IF MEMIDI1

L(I1) = L(I2) = L(I3) = L(I4) = 50ns

EX WBIF MEMIDI2 L(I2) = 50nsEX WB

IF MEMID EX WBIF MEMID EX

0 10 20 30 40 50 60

I3I4




How long does it take to execute (issue) 20000 instructionsin this pipeline? (disregard latency, bubbles caused bybranches, cache misses, hazards)

How long would it take using the same moduleswithout pipelining?

snsnsExecTime pipe 2002000001020000

snsnsExecTime pipenon 5605600002820000




Thus the speedup that we got from the pipeline is:

8.2 200

560

s

s

ExecTime

ExecTimeSpeedup

pipe

pipenonpipe

How can we improve this pipeline design?

We need to reduce the unbalance to increasethe clock speed.



IF ID EX MEM1 WB

5 ns 4 ns 5 ns 5 ns 4 ns

Now we have one more pipeline stage, but themaximum latency of a single stage is reduced in half.

MEM2

5 ns

nsinstr

nsnsnsnsnsnsinstr

WBlatMEMlatMEMlatEXlatIDlatIFlatinstrT

5/1

)4,5,5,5,4,5max(/1

)(),2(),1(),(),(),(max(/1

nsnsL 3056

The new latency for a single instruction is:



IF ID EX MEM1 WB


MEM2

5 ns

IF MEM1IDI1 EX WBMEM2IF MEM1IDI2 EX WBMEM2



IF MEM1IDI7 EX WBMEM2



IF ID EX MEM1 WB


MEM2

5 ns

snsnsExecTime pipe 100100000520000

How long does it take to execute 20000 instructionsin this pipeline? (disregard bubbles caused bybranches, cache misses, etc, for now)

Thus the speedup that we get from the pipeline is:

6.5 100

560

s

s

ExecTime

ExecTimeSpeedup

pipe

pipenonpipe



IF ID EX MEM1 WB


MEM2

5 ns

What have we learned from this example?

1. It is important to balance the delays in the stages of the pipeline

2. The throughput of a pipeline is 1/max(delay).

3. The latency is Nmax(delay), where N is the number of stages in the pipeline.


Pipelining is Not That Easy for Pipelining is Not That Easy for ComputersComputers

• Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle

– Structural hazards: Arise from hardware resource conflicts when the available hardware cannot support all possible combinations of instructions.

– Data hazards: Arise when an instruction depends on the results of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline

– Control hazards: Arise from the pipelining of conditional branches and other instructions that change the PC

• A possible solution is to “stall” the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline

COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

Documents

Transcript of COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.