Motivazioni per la Memoria Virtuale - unibo.it

39
1 Computer Architectures DLX ISA: Pipelined Implementation

Transcript of Motivazioni per la Memoria Virtuale - unibo.it

Page 1: Motivazioni per la Memoria Virtuale - unibo.it

1

Computer Architectures

DLX ISA: Pipelined Implementation

Page 2: Motivazioni per la Memoria Virtuale - unibo.it

2

The Pipelining Principle

Pipelining is nowadays the main basic technique deployed to “speed-up” a CPU.

The key idea for pipelining is general, and is currently applied to several industry fields (productions lines, oil pipelines, …)

A system S, has to execute N times a task A:

A1 , A2 , A3 …AN S R1 , R2 , R3 …RN

Latency : time occurring between the beginning and the end of task A (TA ).

Throughput : frequency at which each task is completed.

Page 3: Motivazioni per la Memoria Virtuale - unibo.it

3

The Pipelining Principle 1) Sequential System

A2 A3 t AN A1

TA

Latency (execution time of a single instruction) = TA Throughput(1) = 1 TA

2) Pipelined System

S

A

P1 P2 P3 P4 t

S1 S2 S3 S4

Si: pipeline stage

Page 4: Motivazioni per la Memoria Virtuale - unibo.it

4

The Pipelining Principle

P1

TP

P2

P1 A2 P2

P3

P1 A3

t

A1

S

S1 S2 S3 S4

P4

P3

P2

P1 A4

P4

P3 P4

P2 P3 P4

An

Latency(2) = 4 *TP = TA

Throughput(2) 1 TP

4 TA

=

= 4 * Throughput(1)

TP : pipeline cycle

Page 5: Motivazioni per la Memoria Virtuale - unibo.it

The Pipelining Principle (2)

5

• Pipelining does not decrease the amount of time needed for carrying out each single task:

Latency(2) = Latency(1)

• Pipelining, instead, increases the Throughput, by multiplying it of a factor K equal to the number of stages of the pipeline:

Throughput(2) = K * Throughput(1)

• This yields a reduction, by the same factor K, of the total execution time of a sequence of N tasks (TN):

𝑻𝑵 =𝑵

𝑻𝒉𝒓𝒐𝒖𝒈𝒉𝒑𝒖𝒕 𝑻𝑵(𝟏) =

𝑵

𝑻𝒉𝒓𝒐𝒖𝒈𝒉𝒑𝒖𝒕(𝟏), 𝑻𝑵(𝟐) =

𝑵

𝑻𝒉𝒓𝒐𝒖𝒈𝒉𝒑𝒖𝒕(𝟐)

𝑺𝒑𝒆𝒆𝒅𝒖𝒑 𝟐 𝒗𝒔 𝟏 =𝑻𝑵 𝟏

𝑻𝑵 𝟐=

𝑻𝒉𝒓𝒐𝒖𝒈𝒉𝒑𝒖𝒕(𝟐)

𝑻𝒉𝒓𝒐𝒖𝒈𝒉𝒑𝒖𝒕(𝟏)=K

Page 6: Motivazioni per la Memoria Virtuale - unibo.it

The Pipelining Principle (2)

6

• Ideal case:

• Real case:

• Example:

• TA = 20 t (t: time unit) • TP1 = 5t, TP2 = 5t, TP3 = 6t, TP4 = 4t TP = 6t

𝑺𝒑𝒆𝒆𝒅𝒖𝒑 𝟐 𝒗𝒔 𝟏 =𝑻𝑨

𝑻𝑷=

𝟐𝟎𝒕

𝟔𝒕=(<4)

𝑻𝑷 = 𝑻𝑷𝒊 =𝑻𝑨

𝑲

𝑻𝑷 = 𝒎𝒂𝒙 𝑻𝑷𝟏, 𝑻𝑷𝟐, . . , 𝑻𝑷𝑲

perfectly balanced pipeline

Speedup = K

(slightly) unbalanced pipeline

Speedup < K

Page 7: Motivazioni per la Memoria Virtuale - unibo.it

7

Pipelining in a CPU (DLX)

Tasks: A1 , A2 , A3 …AN Instructions: I1 , I2 , I3 …IN

I

EX ID t

MEM WB IF

CPI=1 (ideally !)

IF/ID ID/EX EX/MEM MEM/WB

CPU (datapath)

IF ID EX MEM WB

Pipeline Cycle Clock Cycle Delay of the slowest stage

Registers (Pipeline

Registers, FFDs)

Combinatorial circuits

N.B. this architecture is COMPLETELY

different from the sequential one

Page 8: Motivazioni per la Memoria Virtuale - unibo.it

8

Pipeline in the DLX

Instr i

Instr i+1

Instr i+2

Instr i+3

Instr i+4

IF ID EX MEM WB

Tclk = Td + TP + Tsu

Clock Cycle

CPI (ideally) = 1

Overhead introduced by the Pipeline Registers:

Delay of the Input stage register

Set-up of the output stage

register

Delay of the slowest combinatorial stage

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

Page 9: Motivazioni per la Memoria Virtuale - unibo.it

D D Combinatorial

Circuit

Tp

9

Delay of the Input stage register

Set-up of the output stage

register

Delay of the slowest combinatorial stage

Page 10: Motivazioni per la Memoria Virtuale - unibo.it

10

Requirements for implementation of the pipeline

Each stage has to be active during each clock cycle. The PC has to be incremented in the IF stage (instead of ID). An ADDER has to be introduced (PC <-- PC+4 – PC <-PC+1) in the IF stage. Since

instructions are aligned, a 30 bit register (counter) is incremented each clock cycle (2 ls bits are always 0).

Two MDRs are required (that will be referred to as LMDR e SMDR) to handle the situation where a LOAD is immediately followed by a STORE (WB-MEM overlapping – two data waiting to be written (one in memory, the other one in RF) are overlapping.

At every clock cycle, it has to be possible to execute 2 memory accesses (IF, MEM): Instruction Memory (IM) and Data Memory (DM): “Harvard” Architecture

The CPU clock is determined by the slowest stage: IM, DM have to be cache memories (on-chip)

Pipeline Registers store both data and control information (the Control Unit is “distributed” among the pipeline stages)

Page 11: Motivazioni per la Memoria Virtuale - unibo.it

IF ID EX MEM WB

DLX Pipelined Datapath

A D D

4 M U X

DATA MEM

A L U M

U X

M U X

=0?

INSTR MEM

RF

SE

PC

DEC

M U X

IF/ID ID/EX EX/MEM MEM/WB

Sign extension

Number of dest. registers in case of LOAD and ALU instr.

JL and JLR (PC stored in R31)

For computing the new PC

when branching

For operations with immediates

RD

D

RS1

RS2

Number of destination register Data

PC

Actually, it is a programmable counter since the two least-significative bits are always 0

if jumping

For SCn (also <0 and >0)

[acts on the output]

=0?

for Branch

Page 12: Motivazioni per la Memoria Virtuale - unibo.it

12

ID stage

I R

RF

SE

RD

D

RS1

RS2

IF/ID ID/EX

IR25-21

IR20-16

Number of the dest. register (from WB stage)

Data (from WB stage)

(31-16) Immed./Branch

(31-26) Jump

IR15

IR25

LB

SW

IR15-0 (Offset/Immediate/JR/Branch/Load – Dest. reg. )

IR25-16 (J; JL))

PC31-0 (JAL,JALR)

P C

A

B

26 (J and JL)

6

16

32

32

32

32

32

Info travelling with the instruction

IR10-00 (R Instr.) DEC

Sign extension

IR31-26 (Opcode)

Page 13: Motivazioni per la Memoria Virtuale - unibo.it

DLX Pipelined Datapath

A D D

4 M U X

DM

A L U M

U X

M U X

IM RF

SE

PC

DEC

M U X

IF/ID ID/EX EX/MEM MEM/WB

I R 1

A

B

I R 2

P C 2

C O N D

X

S M D R Y

L M D R

IF ID EX MEM WB

P C 1

P C 3

P C 4

Address

Data

I R 3

I R 4

nr. destination register

for SCn (also <0 e >0)

[acts on the output]

=0?

=0?

for Branch

JL JLR

(PC in R31)

SMDR: Store Memory Data Register

LMDR: Load Memory Data Register

IRi : Instruction Register i

X: ALU output, or DMAR, or

Branch Target Address

Y: data computed from prev. stages

Page 14: Motivazioni per la Memoria Virtuale - unibo.it

14

Pipelined execution of an “ALU” instruction

X : “ALUOUTPUT” (in EX/MEM), Y : “ALUOUTPUT1”

NOTE: for these instructions, RS2/RD need to be carried along the pipeline and up to the WB stage

IF

ID

EX

MEM Y <- X (temp. storing, waiting for WB)

WB RD <- Y

IR <- M[PC] ; PC <- PC + 4 ; PC1 <- PC + 4

A <- RS1; B <- RS2; PC2 <- PC1; IR2<-IR1

ID/EX <- Instruction decode;

X <- A op B or

X <- A op (IR215)16 ## IR215..0

[PC4 <- PC3]

[PC3 <- PC2]

Decoded opcode is

carried along all stages

[IR3 <- IR2]

[IR4 <- IR3]

NOTE: IRi bits that are not needed for all instructions are dropped during successive stages. From a stage to the next one, those bits that are needed for all instructions are kept

Page 15: Motivazioni per la Memoria Virtuale - unibo.it

15

Pipelined execution of a “MEM” instruction

IF

ID

EX

MEM

LMDR <- M[X] (LOAD) or

M[X] <- SMDR (STORE)

WB RD <- MDR (LOAD) [Sign ext.]

IR <- M[PC] ; PC <- PC + 4 ; PC1 <- PC + 4

A <- RS1; B <- RS2; PC2 <- PC1; IR2<-IR1

ID/EX <- Instruction decode;

X<- A op (IR215)16 ## IR215..0 SMDR <- B

[PC4 <- PC3]

[PC3 <- PC2] Decoded opcode is

carried along all stages

[IR3 <.- IR2

[IR4 <.- IR3]

X : “DMAR (Data Memory Address Registrer)”

Page 16: Motivazioni per la Memoria Virtuale - unibo.it

16

Pipelined execution of a “BRANCH” instruction (normally after a SCn instruction)

X : “BTA (BRANCH TARGET ADDRESS)”

IF

ID

EX

MEM if (Cond) PC <- X

WB (NOP)

IR <- M[PC] ; PC <- PC + 4 ; PC1 <- PC + 4

A <- RS1; B <- RS2; PC2 <- PC1; IR2<-IR1

ID/EX <- Instruction decode;

X <- PC2 op (IR15)16 ## IR15..0 Cond <- A op 0

[PC4 <- PC3]

[PC3 <- PC2]

Decoded opcode is

carried along all stages

[IR3 <.- IR2]

[IR4 <.- IR3

Branch performed on the current value on register A

If the branch is taken, the PC is overwritten in this stage

Page 17: Motivazioni per la Memoria Virtuale - unibo.it

17

Pipelined execution of a “JR” instruction

ID

MEM

WB

IF

ID

EX

MEM PC <- X

WB (NOP)

IR <- M[PC] ; PC <- PC + 4 ; PC1 <- PC + 4

A <- RS1; B <- RS2; PC2 <- PC1; IR2<-IR1

ID/EX <- Instruction decode;

X <- A

[PC4 <- PC3]

[PC3 <- PC2] Decoded opcode is

carried along all stages

[IR3 <.- IR2]

[IR4 <.- IR3]

What would the stage sequence be for a J instruction?

Page 18: Motivazioni per la Memoria Virtuale - unibo.it

18

Pipelined execution of a “JL or JLR” instruction

ID IF

ID

EX

MEM PC <- X ; PC4<- PC3

WB R31 <- PC4

IR <- M[PC] ; PC <- PC + 4 ; PC1 <- PC + 4

A <- RS1; B <- RS2; PC2 <- P1; IR2<-IR1 ID/EX <- Instruction decode;

PC3 <- PC2 X <- A (If JLR) X <- PC2 + (IR25)6 ## IR25..0 (If JL)

NOTE: Writing on R31 can NOT be done on-the-fly since it could overlap with another register write operation

Decoded opcode is

carried along all stages

[IR4 <- IR3]

[IR3 <.- IR2]

In this case PCi values are used

Page 19: Motivazioni per la Memoria Virtuale - unibo.it

19

What would be the sequence in case of SCn (ex SLT R1,R2,R3) ?

ID IF

ID

EX

MEM

WB

IR <- M[PC] ; PC <- PC + 4 ; PC1 <- PC + 4

A <- RS1; B <- RS2; PC2 <- P1; IR2<-IR1 ID/EX <- Instruction decode;

?

?

?

Page 20: Motivazioni per la Memoria Virtuale - unibo.it

20

Pipeline hazards

• Structural Hazards – The same resource is used by two different pipeline stages: the instructions currently in those stages can not be executed simultaneously.

• Data Hazards – they are due to instruction dependencies. For example, an instruction that needs to read a register not yet written by a previous instruction (Rear After Write - RAW).

• Control Hazards – The instructions that follow a branch depend from the branch result (taken/not taken).

A “Hazard” occurs when, in a specific clock cycle, an instruction currently flowing through a pipeline stage can not be executed in the same clock cycle.

The instruction that can not be executed has to be stopped (“pipeline stall” or “pipeline bubbling”), together with all the following instructions, while the previous instructions can proceed normally (so as to eliminate the hazard).

Page 21: Motivazioni per la Memoria Virtuale - unibo.it

Clk 6 Clk 7 Clk 8

Hazards and stalls

ID

IF ID EX MEM WB Ii-3

Ii-2

Ii-1

Ii

ID EX MEM

ID EX

ID

IF

IF

IF

IF Ii+1

Clk 1 Clk 2 Clk 3 Clk 4 Clk 5

WB

Clk 9 Clk 10 Clk 11 Clk 12

WB

WB

T5 = 8 * CLK = (5 + 3) * CLK

T5 = 5 * (1 + 3/5 ) * CLK

ideal CPI Stalls per Instruction

TN = N * 1 * CLK

TN = N * (1 + S ) * CLK

effective CPI

S S S

S S IF S

MEM WB

Stall: the clock signal for Ii ,Ii+1, .. is stopped for three cycles

The consequence of a data hazard: if instruction Ii needs thre result of instruction Ii-1 (data are read in the ID stage), it has to wait until after WB of Ii-1

Page 22: Motivazioni per la Memoria Virtuale - unibo.it

22

Forwarding

ADD R3, R1, R4

Clk 6 Clk 7 Clk 8

MEM WB

IF ID EX MEM WB

ID EX MEM

ID EX

ID

IF

IF

IF

IF

Clk 1 Clk 2 Clk 3 Clk 4 Clk 5

WB

EX MEM

ID EX

Clk 9

MEM

WB

WB

ID

ID

ID

Forwarding allows eliminating almost all RAW hazards of the DLX pipeline without stalling the pipeline.

(NOTE: in the DLX, registers are modified only in WB)

SUB R7, R3, R5 hazard

OR R1, R3, R5 hazard

LW R6, 100 (R3) hazard

AND R9, R5, R3 no hazard

Here too the data is not yet in RF since it is written on the positive clock edge at the end of WB (the register value is read in ID)

Page 23: Motivazioni per la Memoria Virtuale - unibo.it

23

Forwarding implementation

FU

EX/MEM

M U X

MEM/WB

A L U M

U X

ID/EX

M U X

M U X

RS1/RS2 OPCODE

RD2/OpCode

RD1 (destination register/OpCode) Comparison

between RS1, RS2 and

RD1, RD2 and the Opcodes

RF M U X

Often performed inside the RF

Alternatively, SPLIT-CYCLE (see next)

write before read

It allows “anticipating” the register on ID/EX MUX control: IF/ID opcode and comparison of RD with RS1 and RS2 (IF/ID)

Forwarding Unit

A

B

Offset

Page 24: Motivazioni per la Memoria Virtuale - unibo.it

24

Forwarding Unit

T

In this half-period the register is

written

In this half-period the register is read

• Within the Forwarding Unit, the opcodes of the instructions in the EX, MEM and WB stages are decoded.

• If the instruction in the EX stage needs a register value (either A or B i.e. an ALU instruction, NOT a J or Branch instruction) the opcodes of the instructions in the MEM and WB stages are examined. If they require a register update, the number of the involved register is compared with the register numbers of the instruction in the EX stage. If there is a match then the corresponding data is forwarded to the EX stage, thus replacing the data read from the register file

• The bypass MUXes (inputs of the ID/EX barrier) are needed because a fetched instruction can require the contents of registers whose numbers can match that of the instruction in the WB stage (if it must store a register value). In this case data must be read from the MEM/WB barrier instead from the register file.

• Alternatively, split-cycle:

Page 25: Motivazioni per la Memoria Virtuale - unibo.it

25

Data hazard due to LOAD instructions

NOTE: the datum required by the ADD is available only at the end of the MEM stage. The hazard can not be eliminated by means of forwarding (unless there is an additional input in the MUXs between memory and ALU and everything is done in the same clock cycle – delays, there is a memory access in between which is already slow by itself!)

ADD R4,R1,R7

SUB R5,R1,R8

AND R6,R1,R7

LW R1,32(R6) MEM WB

IF ID EX MEM

IF ID EX

IF ID

IF ID EX

LW R1,32(R6)

ADD R4,R1,R7

SUB R5,R1,R8

AND R6,R1,R7

IF ID EX MEM WB

IF ID S EX MEM

IF S ID EX

S IF ID

The pipeline needs to be stalled

As a matter of fact, the clock signal is not generated. The clock block is propagated along the pipeline one stage at a time.

From the end of this stage onwards: standard forwarding MEM->EX

Page 26: Motivazioni per la Memoria Virtuale - unibo.it

26

Delayed load

In many RISC CPUs, the hazard associated with the LOAD instruction is not handled by the hardware through pipeline stalling, instead it is handled via software by the compiler (delayed load):

LOAD Instruction

delay slot

Next instruction

The compiler tries to fill-in the delay-slot with a “useful” instruction (worst case: NOP).

LW R1,32(R6)

LW R3,10 (R4)

ADD R5,R1,R3

LW R6, 20 (R7)

LW R8, 40(R9)

LW R1,32(R6)

LW R3,10 (R4)

ADD R5,R1,R3

LW R6, 20 (R7)

LW R8, 40(R9)

Page 27: Motivazioni per la Memoria Virtuale - unibo.it

27

Control Hazards

BEQZ R4, 200

PC BEQZ R4, 200

PC+4 SUB R7, R3, R5

PC+8 OR R1, R3, R5

PC+12 LW R6, 100 (R8)

PC+4+200 AND R9, R5, R3 (BTA)

Next Instruction Address

R4 = 0 : Branch Target Address

(taken) R4 0 : PC+4 (not taken)

Clk 6 Clk 7 Clk 8

IF ID EX MEM WB

ID

ID

Clk 1 Clk 2 Clk 3 Clk 4 Clk 5

MEM WB

EX MEM

EX

IF

IF

WB ID

ID

ID IF EX WB ID MEM

Fetch with the new PC

New computed PC value (Aluout)

SUB R7, R3, R5

OR R1, R3, R5

LW R6, 100 (R8)

New value in PC (one clock after)

ID IF EX WB ID MEM

Page 28: Motivazioni per la Memoria Virtuale - unibo.it

A D D

4

IM RF

SE

PC

DEC

Instruction Fetch Instruction Decode

Execute

Memory Write Back

IF/ID ID/EX

A L U M

U X

EX/MEM

M U X

M U X

DLX Pipelined Datapath (Branch or JMP)

BEQZ R4, 200

M U X

DM

MEM/WB

When the new PC acts on the IM three instructions have already travelled through the first three stages (EX included)

NOTE if the feedback signal of the new PC was taken directly from the ALU instead than from ALUOUT the required stalls would obviously be 2 – but: slower clock!

=0?

=0?

Page 29: Motivazioni per la Memoria Virtuale - unibo.it

29

Handling the Control Hazards

BEQZ R4,200

Clk 6 Clk 7 Clk 8

IF ID EX MEM WB

Clk 1 Clk 2 Clk 3 Clk 4 Clk 5

S S IF S

Fetch at new PC • Always Stall (three-clock block being propagated)

Hyp.: Branch Freq.= 25 % CPI = (1 + S ) = ( 1 + 3 * 0.25 ) = 1.75

• Predict Not Taken

IF ID EX MEM WB

ID

ID

ID

BEQZ R4, 200

SUB R7, R3, R5

OR R1, R3, R5

LW R6, 100 (R8)

Clk 6 Clk 7 Clk 8 Clk 1 Clk 2 Clk 3 Clk 4 Clk 5

MEM WB

EX MEM

EX

IF

IF

IF

WB

EX WB ID

ID

ID

MEM

Branch Completion

Flush: they become

NOP

NOP

NOP

NOP

IF – the previous instruction has not been decoded yet

S IF IF ID S Real situation

repeated IF PC <- PC - 4

Here the new value is sampled by the PC

No problem since no instruction has gone through WB!

Page 30: Motivazioni per la Memoria Virtuale - unibo.it

IF ID EX MEM WB

Stalls with jumps (1/3)

A D D

4 M U X

DATA MEM

A L U M

U X

M U X

=0?

INSTR MEM

RF

SE

PC

DEC

M U X

IF/ID ID/EX EX/MEM MEM/WB

RD

D

RS1

RS2

Data

PC

if jumping

=0?

N O P

N O P

N O P

forced NOP for jumping

On the first positive clock edge after sampling the assertion of the jumping condition, 3 NOPs must be inserted to replace the 3 unwanted instructions already present in the pipeline.

Page 31: Motivazioni per la Memoria Virtuale - unibo.it

IF ID EX MEM WB

Stalls when jumping(2/3)

A D D

4 M U X

DATA MEM

A L U M

U X

M U X

=0?

INSTR MEM

RF

SE

PC

DEC

M U X

IF/ID ID/EX EX/MEM MEM/WB

RD

D

RS1

RS2

Data

PC

if jumping

=0?

N O P

N O P

forced NOP when jumping

On the first positive clock edge after sampling the assertion of the jumping condition, 2 NOPs must be inserted to replace the 2 unwanted instructions

NOTE in this case the jump condition and the new PC are sent to the MUX in the same clock cycle as the processing of the condition

Page 32: Motivazioni per la Memoria Virtuale - unibo.it

IF ID EX MEM WB

Stalls when jumping (3/3)

A D D

4

DATA MEM

A L U M

U X

M U X

=0?

INSTR MEM

RF

SE

DEC

M U X

IF/ID ID/EX EX/MEM MEM/WB

RD

D

RS1

RS2

Data

PC

if jumping

=0?

N O P

NOP when jumping

On the first positive clock edge after the assertion of the jumping condition, a NOP is inserted to replace the instruction currently in the IF/ID stage

NOTE In this case the jumping condition and the new PC control the MUX in the same moment as the processing of the condition

PC

M U X

Page 33: Motivazioni per la Memoria Virtuale - unibo.it

33

Independent ALU for BRANCH/JMP

To reduce the number of stalls

BTA <-PC1+ (IR15)16 ## IR15-0 /(IR25)6 ## IR25..0

if Branch: if (RS1 op 0) PC <- BTA if JMP always PC <- BTA

IF

ID

EX -------------------------

MEM

WB

-------------------------

-------------------------

(New fetch only one stall)

ALU (additional full adder)

N.B. The full adder is separated from the adder “+4” (this means it overlaps with the addition required to compute the next instruction!), otherwise the same adder has to be used together with some multiplexers (so to select whether to add 4 or the offset, and whether to use PC or PC1)

A <- RS1; B <- RS2; PC2 <- PC1 ID/EX <- Decode; ID/EX <- Opc ext.

IR <- M[PC] ; PC <- PC + 4; PC1 <- PC + 4

NOTE here there is only one “stall” since the new value is inserted in the PC on the positive clock edge that ends the ID stage while, in the previous case, it was inserted after the MEM stage, that is, two clock later!!

Page 34: Motivazioni per la Memoria Virtuale - unibo.it

BRANCH/JMP – 1 stall

A D D E R

4

IM RF

PC

DEC

IF/ID ID/EX

I R 1

IF ID

P C 1

M U X

M U X

SE

##

A

B

P C 2

NOTE: for the “Unconditional Jump” instructions there is an analogous situation: we only need to provide further inputs to the MUXs of the PC by taking into consideration either the RS1 register (JR and JRL) or the 26

less-significant bits of the IR with SE (J and JL) to be added to the current PC)

The new PC is selected according to the opcode and the value of the branch test register

= 0 ?

For Branches

Standard addition

Branch

Offset and sign

extension

Displacement of the Branch instruction

PC of the Branch instruction

This actually coincide with the current value in PC (can be avoided)

Page 35: Motivazioni per la Memoria Virtuale - unibo.it

35

Delayed branch

Similarly to the LOAD case, with several RISC CPUs the hazard associated with BRANCH instructions is handled via SW by the compiler (delayed branch):

BRANCH instruction

delay slot

Next instruction

The compiler tries to fill-in the delay-slots with “useful” instructions (worst case: NOP).

delay slot

delay slot

Page 36: Motivazioni per la Memoria Virtuale - unibo.it

36

Delayed branch/jump

Add R5, R4, R3 Sub R6, R5, R2 Or R14, R6, R21 Sne R1, R8, R9 ; branch condition

Br R1, +100

Sne R1, R8, R9 ; branch condition

Br R1, +100 Add R5, R4, R3 Sub R6, R5, R2 Or R14, R6, R21

Compiled Original

Executed in both cases Obviously in this group

of instructions there must be no jumps!!!

Instead of one or more “postponed” instructions, the compiler inserts NOPs in case no suitable instructions are available

Page 37: Motivazioni per la Memoria Virtuale - unibo.it

37

Handling the Control Hazards

Dynamic Prediction: Branch Target Buffer -> no stall (almost..)

T/NT

TAGS

Predicted PC

PC

= HIT : Fetch with predicted PC

MISS : Fetch with PC + 4

Correct prediction : no stall

Wrong prediction : 1-3 stalls (correct fetch in ID or EX, see before)

N.B. Here the branch slot is selected during the IF clock cycle that loads IR1 in IF/ID

Page 38: Motivazioni per la Memoria Virtuale - unibo.it

38

Prediction Buffer: the simplest implementation uses a single bit that indicates what happened when the

last branch occurred.

In case of predominance of one prediction, when the opposite situation occurs we have two

consecutive errors.

Loop1 Loop2

When exiting loop2, the prediction fails (branch predicted as taken but actually it

is untaken), then it fails again when it predicts as untaken whilst entering once

again loop2

Page 39: Motivazioni per la Memoria Virtuale - unibo.it

39

Hence, usually two bits are used for branch prediction:

TAKEN

TAKEN

UNTAKEN

UNTAKEN

TAKEN

UNTAKEN

TAKEN

UNTAKEN

TAKEN

TAKEN

UNTAKEN UNTAKEN