Introduction Performance Analysis Single-Cycle Processor...

33
1 Copyright © 2007 Elsevier 7-<1> Chapter 7 :: Microarchitecture Digital Design and Computer Architecture David Money Harris and Sarah L. Harris Copyright © 2007 Elsevier 7-<2> Chapter 7 :: Topics Introduction Performance Analysis Single-Cycle Processor Multicycle Processor Pipelined Processor Exceptions Advanced Microarchitecture Copyright © 2007 Elsevier 7-<3> Introduction Microarchitecture: how to implement an architecture in hardware Processor: Datapath: functional blocks Control: control signals Physics Devices Analog Circuits Digital Circuits Logic Micro- architecture Architecture Operating Systems Application Software electrons transistors diodes amplifiers filters AND gates NOT gates adders memories datapaths controllers instructions registers device drivers programs Copyright © 2007 Elsevier 7-<4> Microarchitecture Multiple implementations for a single architecture: – Single-cycle Each instruction executes in a single cycle – Multicycle Each instruction is broken up into a series of shorter steps – Pipelined Each instruction is broken up into a series of steps Multiple instructions execute at once.

Transcript of Introduction Performance Analysis Single-Cycle Processor...

Page 1: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

1

Copyright © 2007 Elsevier 7-<1>

Chapter 7 :: Microarchitecture

Digital Design and Computer Architecture David Money Harris and Sarah L. Harris

Copyright © 2007 Elsevier 7-<2>

Chapter 7 :: Topics

• Introduction• Performance Analysis• Single-Cycle Processor• Multicycle Processor• Pipelined Processor• Exceptions• Advanced Microarchitecture

Copyright © 2007 Elsevier 7-<3>

Introduction

• Microarchitecture: how to implement an architecture in hardware

• Processor:– Datapath: functional blocks– Control: control signals

Physics

Devices

AnalogCircuits

DigitalCircuits

Logic

Micro-architecture

Architecture

OperatingSystems

ApplicationSoftware

electrons

transistorsdiodes

amplifiersfilters

AND gatesNOT gates

addersmemories

datapathscontrollers

instructionsregisters

device drivers

programs

Copyright © 2007 Elsevier 7-<4>

Microarchitecture

• Multiple implementations for a single architecture:– Single-cycle

• Each instruction executes in a single cycle– Multicycle

• Each instruction is broken up into a series of shorter steps– Pipelined

• Each instruction is broken up into a series of steps• Multiple instructions execute at once.

Page 2: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

2

Copyright © 2007 Elsevier 7-<5>

Processor Performance

• Program execution time

Execution Time = (# instructions)(cycles/instruction)(seconds/cycle)

• Definitions:– Cycles/instruction = CPI– Seconds/cycle = clock period– 1/CPI = Instructions/cycle = IPC

• Challenge is to satisfy constraints of:– Cost– Power– Performance

Copyright © 2007 Elsevier 7-<6>

MIPS Processor

• We consider a subset of MIPS instructions:– R-type instructions: and, or, add, sub, slt– Memory instructions: lw, sw– Branch instructions: beq

• Later consider adding addi and j

Copyright © 2007 Elsevier 7-<7>

Architectural State

• Determines everything about a processor:– PC– 32 registers– Memory

Copyright © 2007 Elsevier 7-<8>

MIPS State Elements

CLK

A RDInstruction

Memory

A1

A3WD3

RD2RD1

WE3

A2

CLK

RegisterFile

A RDData

MemoryWD

WEPCPC'

CLK

32 3232 32

32

32

32 32

32

32

5

5

5

Page 3: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

3

Copyright © 2007 Elsevier 7-<9>

Single-Cycle MIPS Processor

• Datapath• Control

Copyright © 2007 Elsevier 7-<10>

Single-Cycle Datapath: lw fetch

• First consider executing lw• STEP 1: Fetch instruction

CLK

A RDInstruction

Memory

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

A RDData

MemoryWD

WEPCPC' Instr

CLK

Copyright © 2007 Elsevier 7-<11>

Single-Cycle Datapath: lw register read

• STEP 2: Read source operands from register file

Instr

CLK

A RDInstruction

Memory

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

A RDData

MemoryWD

WEPCPC'

25:21

CLK

Copyright © 2007 Elsevier 7-<12>

Single-Cycle Datapath: lw immediate

• STEP 3: Sign-extend the immediate

SignImm

CLK

A RDInstruction

Memory

A1

A3WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

A RDData

MemoryWD

WEPCPC' Instr 25:21

15:0

CLK

Page 4: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

4

Copyright © 2007 Elsevier 7-<13>

Single-Cycle Datapath: lw address

• STEP 4: Compute the memory address

SignImm

CLK

A RDInstruction

Memory

A1

A3WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

A RDData

MemoryWD

WEPCPC' Instr 25:21

15:0

SrcB

ALUResult

SrcA Zero

CLK

ALUControl2:0

ALU

010

Copyright © 2007 Elsevier 7-<14>

Single-Cycle Datapath: lw memory read

• STEP 5: Read data from memory and write it back to register file

A1

A3WD3

RD2

RD1WE3

A2

SignImm

CLK

A RDInstruction

Memory

CLK

Sign Extend

RegisterFile

A RDData

MemoryWD

WEPCPC' Instr 25:21

15:0

SrcB20:16

ALUResult ReadData

SrcA

RegWrite

Zero

CLK

ALUControl2:0

ALU

0101

Copyright © 2007 Elsevier 7-<15>

Single-Cycle Datapath: lw PC increment

• STEP 6: Determine the address of the next instruction

SignImm

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

A RDData

MemoryWD

WEPCPC' Instr 25:21

15:0

SrcB20:16

ALUResult ReadData

SrcA

PCPlus4

Result

RegWrite

Zero

CLK

ALUControl2:0

ALU

0101

Copyright © 2007 Elsevier 7-<16>

Single-Cycle Datapath: sw

• Consider sw• Need to write data to memory

SignImm

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

A RDData

MemoryWD

WEPCPC' Instr 25:21

20:16

15:0

SrcB20:16

ALUResult ReadData

WriteData

SrcA

PCPlus4

Result

MemWriteRegWrite

Zero

CLK

ALUControl2:0

ALU

10100

Page 5: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

5

Copyright © 2007 Elsevier 7-<17>

Single-Cycle Datapath: R-type instructions

• Consider R-type instructions: add, sub, and, or• Write ALUResult to register file• Writing to rd field of instruction (instead of rt)

SignImm

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

01

01

A RDData

MemoryWD

WE01

PCPC' Instr 25:21

20:16

15:0

SrcB

20:16

15:11

ALUResult ReadData

WriteData

SrcA

PCPlus4WriteReg4:0

Result

RegDst MemWrite MemtoRegALUSrcRegWrite

Zero

CLK

ALUControl2:0

ALU

0varies1 001

Copyright © 2007 Elsevier 7-<18>

Single-Cycle Datapath: beq

• Consider branch instruction: beq• Determine whether register values are equal• Calculate branch target address (BTA) from sign-extended

immediate and PC+4

SignImm

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

01

01

A RDData

MemoryWD

WE01

PC01

PC' Instr 25:21

20:16

15:0

SrcB

20:16

15:11

<<2

+

ALUResult ReadData

WriteData

SrcA

PCPlus4

PCBranch

WriteReg4:0

Result

RegDst Branch MemWrite MemtoRegALUSrcRegWrite

Zero

PCSrc

CLK

ALUControl2:0

ALU

01100 x0x 1

Copyright © 2007 Elsevier 7-<19>

Complete Single-Cycle Processor

SignImm

CLK

A RDInstruction

Memory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

01

01

A RDData

MemoryWD

WE01

PC01

PC' Instr 25:21

20:16

15:0

5:0

SrcB

20:16

15:11

<<2

+

ALUResult ReadData

WriteData

SrcA

PCPlus4

PCBranch

WriteReg4:0

Result

31:26

RegDst

BranchMemWriteMemtoReg

ALUSrc

RegWrite

OpFunct

ControlUnit

Zero

PCSrc

CLK

ALUControl2:0

ALU

Copyright © 2007 Elsevier 7-<20>

Control Unit

RegDst

BranchMemWriteMemtoReg

ALUSrcOpcode5:0

ControlUnit

ALUControl2:0Funct5:0

MainDecoder

ALUOp1:0

ALUDecoder

RegWrite

Page 6: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

6

Copyright © 2007 Elsevier 7-<21>

Review: ALU

ALU

N N

N3

A B

Y

F

A & B000A | B001A + B010not used011A & B100A | ~B101A - B110SLT111

FunctionF2:0

Copyright © 2007 Elsevier 7-<22>

Review: ALU

+

2 01

A B

Cout

Y3

01

F2

F1:0

[N-1] S

NN

N

N

N NNN

N

2

ZeroE

xtend

Copyright © 2007 Elsevier 7-<23>

Control Unit: ALU Decoder

Not Used11Look at Funct10Subtract01Add00MeaningALUOp1:0

111 (SLT)101010 (slt)1X001 (Or)100101 (or)1X000 (And)100100 (and)1X110 (Subtract)100010 (sub)1X010 (Add)100000 (add)1X110 (Subtract)XX1010 (Add)X00ALUControl2:0FunctALUOp1:0

Copyright © 2007 Elsevier 7-<24>

Control Unit: Main Decoder

000100beq

101011sw

100011lw

000000R-type

ALUOp1:0MemtoRegMemWriteBranchAluSrcRegDstRegWriteOp5:0Instruction

SignImm

CLK

A RDInstruction

Memory

+4

A1

A3WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

01

01

A RDData

MemoryWD

WE01

PC01

PC' Instr 25:21

20:16

15:0

5:0

SrcB

20:16

15:11

<<2

+

ALUResult ReadData

WriteData

SrcA

PCPlus4

PCBranch

WriteReg4:0

Result

31:26

RegDst

BranchMemWriteMemtoReg

ALUSrc

RegWrite

OpFunct

ControlUnit

Zero

PCSrc

CLK

ALUControl2:0

ALU

Page 7: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

7

Copyright © 2007 Elsevier 7-<25>

Control Unit: Main Decoder

01X010X0000100beq

00X101X0101011sw

00000101100011lw

10000011000000R-type

ALUOp1:0MemtoRegMemWriteBranchAluSrcRegDstRegWriteOp5:0Instruction

Copyright © 2007 Elsevier 7-<26>

Single-Cycle Datapath Example: or

SignImm

CLK

A RDInstruction

Memory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

01

01

A RDData

MemoryWD

WE01

PC01

PC' Instr 25:21

20:16

15:0

5:0

SrcB

20:16

15:11

<<2

+

ALUResult ReadData

WriteData

SrcA

PCPlus4

PCBranch

WriteReg4:0

Result

31:26

RegDst

BranchMemWriteMemtoReg

ALUSrc

RegWrite

OpFunct

ControlUnit

Zero

PCSrc

CLK

ALUControl2:0

ALU

0010

01

0

0

1

0

Copyright © 2007 Elsevier 7-<27>

Extended Functionality: addi

• No change to datapath

SignImm

CLK

A RDInstruction

Memory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

01

01

A RDData

MemoryWD

WE01

PC01

PC' Instr 25:21

20:16

15:0

5:0

SrcB

20:16

15:11

<<2

+

ALUResult ReadData

WriteData

SrcA

PCPlus4

PCBranch

WriteReg4:0

Result

31:26

RegDst

BranchMemWriteMemtoReg

ALUSrc

RegWrite

OpFunct

ControlUnit

Zero

PCSrc

CLK

ALUControl2:0

ALU

Copyright © 2007 Elsevier 7-<28>

Control Unit: addi

001000addi

01X010X0000100beq

00X101X0101011sw

00100101100011lw

10000011000000R-type

ALUOp1:0MemtoRegMemWriteBranchAluSrcRegDstRegWriteOp5:0Instruction

Page 8: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

8

Copyright © 2007 Elsevier 7-<29>

Control Unit: addi

00000101001000addi

01X010X0000100beq

00X101X0101011sw

00100101100011lw

10000011000000R-type

ALUOp1:0MemtoRegMemWriteBranchAluSrcRegDstRegWriteOp5:0Instruction

Copyright © 2007 Elsevier 7-<30>

Extended Functionality: j

SignImm

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

01

01

A RDData

MemoryWD

WE01

PC01

PC' Instr 25:21

20:16

15:0

5:0

SrcB

20:16

15:11

<<2

+

ALUResult ReadData

WriteData

SrcA

PCPlus4

PCBranch

WriteReg4:0

Result

31:26

RegDst

BranchMemWriteMemtoReg

ALUSrc

RegWrite

OpFunct

ControlUnit

Zero

PCSrc

CLK

ALUControl2:0

ALU

01

25:0 <<2

27:0 31:28

PCJump

Jump

Copyright © 2007 Elsevier 7-<31>

Control Unit: Main Decoder

000100j

0000

Jump

01X010X0000100beq

00X101X0101011sw

00100101100011lw

10000011000000R-type

ALUOp1:0MemtoRegMemWriteBranchAluSrcRegDstRegWriteOp5:0Instruction

Copyright © 2007 Elsevier 7-<32>

Control Unit: Main Decoder

1XXX0XXX0000100j

0000

Jump

01X010X0000100beq

00X101X0101011sw

00100101100011lw

10000011000000R-type

ALUOp1:0MemtoRegMemWriteBranchAluSrcRegDstRegWriteOp5:0Instruction

Page 9: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

9

Copyright © 2007 Elsevier 7-<33>

Single-Cycle Performance

• Clock cycle time is limited by the critical path (lw)

SignImm

CLK

A RDInstruction

Memory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

01

01

A RDData

MemoryWD

WE01

PC01

PC' Instr 25:21

20:16

15:0

5:0

SrcB

20:16

15:11

<<2

+

ALUResult ReadData

WriteData

SrcA

PCPlus4

PCBranch

WriteReg4:0

Result

31:26

RegDst

BranchMemWriteMemtoReg

ALUSrc

RegWrite

OpFunct

ControlUnit

Zero

PCSrc

CLK

ALUControl2:0

ALU1

0100

1

0

1

0 0

Copyright © 2007 Elsevier 7-<34>

Single-Cycle Performance

• Single-cycle critical path:Tc = tpcq_PC + tmem + max(tRFread, tsext) + tmux + tALU + tmem + tmux + tRFsetup

• In most implementations, limiting paths are: – memory, ALU, register file. – Tc = tpcq_PC + 2tmem + tRFread + 2tmux + tALU + tRFsetup

Copyright © 2007 Elsevier 7-<35>

Single-Cycle Performance Example

Tc =

30tpcq_PCRegister clock-to-Q

Delay (ps)ParameterElement

20tsetupRegister setup

25tmuxMultiplexer

200tALUALU

250tmemMemory read

150tRFreadRegister file read

20tRFsetupRegister file setup

Copyright © 2007 Elsevier 7-<36>

Single-Cycle Performance Example

Tc = tpcq_PC + 2tmem + tRFread + 2tmux + tALU + tRFsetup

= [30 + 2(250) + 150 + 2(25) + 200 + 20] ps= 950 ps

30tpcq_PCRegister clock-to-Q

Delay (ps)ParameterElement

20tsetupRegister setup

25tmuxMultiplexer

200tALUALU

250tmemMemory read

150tRFreadRegister file read

20tRFsetupRegister file setup

Page 10: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

10

Copyright © 2007 Elsevier 7-<37>

Single-Cycle Performance Example

• For a program with 100 billion instructions executing on a single-cycle MIPS processor,

Execution Time =

Copyright © 2007 Elsevier 7-<38>

Single-Cycle Performance Example

• For a program with 100 billion instructions executing on a single-cycle MIPS processor,

Execution Time = (# instructions)(cycles/instruction)(seconds/cycle)= (100 × 109)(1)(950 × 10-12 s)= 95 seconds

Copyright © 2007 Elsevier 7-<39>

Multicycle MIPS Processor

• The single-cycle microarchitecture performs all steps in one cycle+ simple- cycle time limited by longest instruction (lw)- two adders and two memories

• Multicycle microarchitecture uses a variable number of shorter steps (ALU or memory)+ higher clock speed+ simpler instructions run faster+ reuse expensive hardware on multiple cycles- sequencing overhead paid many times

• Same design steps: datapath & controlCopyright © 2007 Elsevier 7-<40>

Multicycle State Elements

CLK

ARD

Instr / DataMemory

A1

A3

WD3

RD2RD1

WE3

A2

CLK

RegisterFile

PCPC'

WD

WE

CLK

EN

• Replace Instruction and Data memories with a single unified memory– More realistic

Page 11: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

11

Copyright © 2007 Elsevier 7-<41>

Multicycle Datapath: instruction fetch

b

CLK

ARD

Instr / DataMemory

A1

A3

WD3

RD2RD1

WE3

A2

CLK

RegisterFile

PCPC' Instr

CLK

WD

WE

CLK

EN

IRWrite

• First consider executing lw• STEP 1: Fetch instruction

Copyright © 2007 Elsevier 7-<42>

Multicycle Datapath: lw register read

b

CLK

ARD

Instr / DataMemory

A1

A3

WD3

RD2RD1

WE3

A2

CLK

RegisterFile

PCPC' Instr 25:21

CLK

WD

WE

CLK CLK

A

EN

IRWrite

Copyright © 2007 Elsevier 7-<43>

Multicycle Datapath: lw immediate

SignImm

b

CLK

ARD

Instr / DataMemory

A1

A3

WD3

RD2RD1

WE3

A2

CLK

Sign Extend

RegisterFile

PCPC' Instr 25:21

15:0

CLK

WD

WE

CLK CLK

A

EN

IRWrite

Copyright © 2007 Elsevier 7-<44>

Multicycle Datapath: lw address

SignImm

b

CLK

ARD

Instr / DataMemory

A1

A3

WD3

RD2RD1

WE3

A2

CLK

Sign Extend

RegisterFile

PCPC' Instr 25:21

15:0

SrcB

ALUResult

SrcA

ALUOut

CLK

ALUControl2:0

ALU

WD

WE

CLK CLK

A CLK

EN

IRWrite

Page 12: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

12

Copyright © 2007 Elsevier 7-<45>

Multicycle Datapath: lw memory read

SignImm

b

CLK

ARD

Instr / DataMemory

A1

A3

WD3

RD2RD1

WE3

A2

CLK

Sign Extend

RegisterFile

PCPC' Instr 25:21

15:0

SrcB

ALUResult

SrcA

ALUOut

CLK

ALUControl2:0

ALU

WD

WE

CLK

Adr

Data

CLK

CLK

A CLK

EN

IRWriteIorD

01

Copyright © 2007 Elsevier 7-<46>

Multicycle Datapath: lw write register

SignImm

b

CLK

ARD

Instr / DataMemory

A1

A3

WD3

RD2RD1

WE3

A2

CLK

Sign Extend

RegisterFile

PCPC' Instr 25:21

15:0

SrcB20:16

ALUResult

SrcA

ALUOut

RegWrite

CLK

ALUControl2:0

ALU

WD

WE

CLK

Adr

Data

CLK

CLK

A CLK

EN

IRWriteIorD

01

Copyright © 2007 Elsevier 7-<47>

Multicycle Datapath: increment PC

PCWrite

SignImm

b

CLK

ARD

Instr / DataMemory

A1

A3

WD3

RD2RD1

WE3

A2

CLK

Sign Extend

RegisterFile

01PCPC' Instr 25:21

15:0

SrcB20:16

ALUResult

SrcA

ALUOut

ALUSrcARegWrite

CLK

ALUControl2:0

ALU

WD

WE

CLK

Adr

Data

CLK

CLK

A

00011011

4

CLK

ENEN

ALUSrcB1:0IRWriteIorD

01

Copyright © 2007 Elsevier 7-<48>

Multicycle Datapath: sw

SignImm

b

CLK

ARD

Instr / DataMemory

A1

A3

WD3

RD2RD1

WE3

A2

CLK

Sign Extend

RegisterFile

01PC 0

1

PC' Instr 25:21

20:16

15:0

SrcB20:16

ALUResult

SrcA

ALUOut

MemWrite ALUSrcARegWrite

CLK

ALUControl2:0

ALU

WD

WE

CLK

Adr

Data

CLK

CLK

A

00011011

4

CLK

ENEN

ALUSrcB1:0IRWriteIorDPCWrite

B

• Consider sw• Need to write data to memory

Page 13: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

13

Copyright © 2007 Elsevier 7-<49>

Multicycle Datapath: R-type Instructions

01

SignImm

b

CLK

ARD

Instr / DataMemory

A1

A3

WD3

RD2RD1

WE3

A2

CLK

Sign Extend

RegisterFile

01

01PC 0

1

PC' Instr 25:21

20:16

15:0

SrcB20:16

15:11

ALUResult

SrcA

ALUOut

RegDstMemWrite MemtoReg ALUSrcARegWrite

CLK

ALUControl2:0

ALU

WD

WE

CLK

Adr

Data

CLK

CLK

AB 00

011011

4

CLK

ENEN

ALUSrcB1:0IRWriteIorDPCWrite

• Consider R-type instructions: add, sub, and, or• Write ALUResult to register file• Writing to rd field of instruction (instead of rt)

Copyright © 2007 Elsevier 7-<50>

• Consider branch instruction: beq• Determine whether register values are equal• Calculate branch target address (BTA) from sign-extended

immediate and PC+4

Multicycle Datapath: beq

SignImm

b

CLK

ARD

Instr / DataMemory

A1

A3

WD3

RD2RD1

WE3

A2

CLK

Sign Extend

RegisterFile

01

01 0

1

PC 01

PC' Instr 25:21

20:16

15:0

SrcB20:16

15:11

<<2

ALUResult

SrcA

ALUOut

RegDst BranchMemWrite MemtoReg ALUSrcARegWrite

Zero

PCSrc

CLK

ALUControl2:0

ALU

WD

WE

CLK

Adr

01

Data

CLK

CLK

AB 00

011011

4

CLK

ENEN

ALUSrcB1:0IRWriteIorD PCWrite

PCEn

Copyright © 2007 Elsevier 7-<51>

Complete Multicycle Processor

SignImm

CLK

ARD

Instr / DataMemory

A1

A3

WD3

RD2RD1

WE3

A2

CLK

Sign Extend

RegisterFile

01

01 0

1

PC 01

PC' Instr 25:21

20:16

15:0

5:0

SrcB20:16

15:11

<<2

ALUResult

SrcA

ALUOut

31:26

RegD

st

Branch

MemWrite

Mem

toReg

ALUSrcARegWrite

OpFunct

ControlUnit

Zero

PCSrc

CLK

CLK

ALUControl2:0

ALU

WD

WE

CLK

Adr

01

Data

CLK

CLK

AB 00

011011

4

CLK

ENEN

ALUSrcB1:0IRWrite

IorD

PCWritePCEn

Copyright © 2007 Elsevier 7-<52>

Control Unit

ALUSrcA

PCSrc

Branch

ALUSrcB1:0

Opcode5:0

ControlUnit

ALUControl2:0Funct5:0

MainController

(FSM)

ALUOp1:0

ALUDecoder

RegWrite

PCWrite

IorD

MemWriteIRWrite

RegDstMemtoReg

RegisterEnables

MultiplexerSelects

Page 14: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

14

Copyright © 2007 Elsevier 7-<53>

Main Controller FSM: Fetch

SignImm

CLK

ARD

Instr / DataMemory

A1

A3

WD3

RD2RD1

WE3

A2

CLK

Sign Extend

RegisterFile

01

01 0

1

PC 01

PC' Instr 25:21

20:16

15:0

5:0

SrcB20:16

15:11

<<2

ALUResult

SrcA

ALUOut

31:26

RegD

st

Branch

MemWrite

Mem

toReg

ALUSrcARegWrite

OpFunct

ControlUnit

Zero

PCSrc

CLK

CLK

ALUControl2:0

ALU

WD

WE

CLK

Adr

01

Data

CLK

CLK

AB 00

011011

4

CLK

ENEN

ALUSrcB1:0IRWrite

IorD

PCWritePCEn

0

1 1

0

X

X

00

01

0100

10

Reset

S0: Fetch

Copyright © 2007 Elsevier 7-<54>

Main Controller FSM: Fetch

SignImm

CLK

ARD

Instr / DataMemory

A1

A3

WD3

RD2RD1

WE3

A2

CLK

Sign Extend

RegisterFile

01

01 0

1

PC 01

PC' Instr 25:21

20:16

15:0

5:0

SrcB20:16

15:11

<<2

ALUResult

SrcA

ALUOut

31:26

RegD

st

Branch

MemWrite

Mem

toReg

ALUSrcARegWrite

OpFunct

ControlUnit

Zero

PCSrc

CLK

CLK

ALUControl2:0

ALU

WD

WE

CLK

Adr

01

Data

CLK

CLK

AB 00

011011

4

CLK

ENEN

ALUSrcB1:0IRWrite

IorD

PCWritePCEn

0

1 1

0

X

X

00

01

0100

10

IorD = 0AluSrcA = 0

ALUSrcB = 01ALUOp = 00PCSrc = 0

IRWritePCWrite

Reset

S0: Fetch

Copyright © 2007 Elsevier 7-<55>

Main Controller FSM: Decode

IorD = 0AluSrcA = 0

ALUSrcB = 01ALUOp = 00PCSrc = 0

IRWritePCWrite

Reset

S0: Fetch S1: Decode

SignImm

CLK

ARD

Instr / DataMemory

A1

A3

WD3

RD2RD1

WE3

A2

CLK

Sign Extend

RegisterFile

01

01 0

1

PC 01

PC' Instr 25:21

20:16

15:0

5:0

SrcB20:16

15:11

<<2

ALUResult

SrcA

ALUOut

31:26

RegD

st

Branch

MemWrite

Mem

toReg

ALUSrcARegWrite

OpFunct

ControlUnit

Zero

PCSrc

CLK

CLK

ALUControl2:0

ALU

WD

WE

CLK

Adr

01

Data

CLK

CLK

AB 00

011011

4

CLK

ENEN

ALUSrcB1:0IRWrite

IorD

PCWritePCEn

X

0 0

0

X

X

0X

XX

XXXX

00

Copyright © 2007 Elsevier 7-<56>

Main Controller FSM: Address Calculation

IorD = 0AluSrcA = 0

ALUSrcB = 01ALUOp = 00PCSrc = 0

IRWritePCWrite

Reset

S0: Fetch

S2: MemAdr

S1: Decode

Op = LWor

Op = SW

SignImm

CLK

ARD

Instr / DataMemory

A1

A3

WD3

RD2RD1

WE3

A2

CLK

Sign Extend

RegisterFile

01

01 0

1

PC 01

PC' Instr 25:21

20:16

15:0

5:0

SrcB20:16

15:11

<<2

ALUResult

SrcA

ALUOut

31:26

RegD

st

Branch

MemWrite

Mem

toReg

ALUSrcARegWrite

OpFunct

ControlUnit

Zero

PCSrc

CLK

CLK

ALUControl2:0

ALU

WD

WE

CLK

Adr

01

Data

CLK

CLK

AB 00

011011

4

CLK

ENEN

ALUSrcB1:0IRWrite

IorD

PCWritePCEn

X

0 0

0

X

X

01

10

010X

00

Page 15: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

15

Copyright © 2007 Elsevier 7-<57>

Main Controller FSM: Address Calculation

IorD = 0AluSrcA = 0

ALUSrcB = 01ALUOp = 00PCSrc = 0

IRWritePCWrite

ALUSrcA = 1ALUSrcB = 10ALUOp = 00

Reset

S0: Fetch

S2: MemAdr

S1: Decode

Op = LWor

Op = SW

SignImm

CLK

ARD

Instr / DataMemory

A1

A3

WD3

RD2RD1

WE3

A2

CLK

Sign Extend

RegisterFile

01

01 0

1

PC 01

PC' Instr 25:21

20:16

15:0

5:0

SrcB20:16

15:11

<<2

ALUResult

SrcA

ALUOut

31:26

RegD

st

Branch

MemWrite

Mem

toReg

ALUSrcARegWrite

OpFunct

ControlUnit

Zero

PCSrc

CLK

CLK

ALUControl2:0

ALU

WD

WE

CLK

Adr

01

Data

CLK

CLK

AB 00

011011

4

CLK

ENEN

ALUSrcB1:0IRWrite

IorD

PCWritePCEn

X

0 0

0

X

X

01

10

010X

00

Copyright © 2007 Elsevier 7-<58>

Main Controller FSM: lw

IorD = 0AluSrcA = 0

ALUSrcB = 01ALUOp = 00PCSrc = 0

IRWritePCWrite

ALUSrcA = 1ALUSrcB = 10ALUOp = 00

IorD = 1

Reset

S0: Fetch

S2: MemAdr

S1: Decode

S3: MemRead

Op = LWor

Op = SW

Op = LW

RegDst = 0MemtoReg = 1

RegWrite

S4: MemWriteback

Copyright © 2007 Elsevier 7-<59>

Main Controller FSM: sw

IorD = 0AluSrcA = 0

ALUSrcB = 01ALUOp = 00PCSrc = 0

IRWritePCWrite

ALUSrcA = 1ALUSrcB = 10ALUOp = 00

IorD = 1 IorD = 1MemWrite

Reset

S0: Fetch

S2: MemAdr

S1: Decode

S3: MemReadS5: MemWrite

Op = LWor

Op = SW

Op = LWOp = SW

RegDst = 0MemtoReg = 1

RegWrite

S4: MemWriteback

Copyright © 2007 Elsevier 7-<60>

Main Controller FSM: R-Type

IorD = 0AluSrcA = 0

ALUSrcB = 01ALUOp = 00PCSrc = 0

IRWritePCWrite

ALUSrcA = 1ALUSrcB = 10ALUOp = 00

IorD = 1RegDst = 1

MemtoReg = 0RegWrite

IorD = 1MemWrite

ALUSrcA = 1ALUSrcB = 00ALUOp = 10

Reset

S0: Fetch

S2: MemAdr

S1: Decode

S3: MemReadS5: MemWrite

S6: Execute

S7: ALUWriteback

Op = LWor

Op = SWOp = R-type

Op = LWOp = SW

RegDst = 0MemtoReg = 1

RegWrite

S4: MemWriteback

Page 16: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

16

Copyright © 2007 Elsevier 7-<61>

Main Controller FSM: beq

IorD = 0AluSrcA = 0

ALUSrcB = 01ALUOp = 00PCSrc = 0

IRWritePCWrite

ALUSrcA = 0ALUSrcB = 11ALUOp = 00

ALUSrcA = 1ALUSrcB = 10ALUOp = 00

IorD = 1RegDst = 1

MemtoReg = 0RegWrite

IorD = 1MemWrite

ALUSrcA = 1ALUSrcB = 00ALUOp = 10

ALUSrcA = 1ALUSrcB = 00ALUOp = 01PCSrc = 1

Branch

Reset

S0: Fetch

S2: MemAdr

S1: Decode

S3: MemReadS5: MemWrite

S6: Execute

S7: ALUWriteback

S8: Branch

Op = LWor

Op = SWOp = R-type

Op = BEQ

Op = LWOp = SW

RegDst = 0MemtoReg = 1

RegWrite

S4: MemWriteback

Copyright © 2007 Elsevier 7-<62>

Complete Multicycle Controller FSM

IorD = 0AluSrcA = 0

ALUSrcB = 01ALUOp = 00PCSrc = 0

IRWritePCWrite

ALUSrcA = 0ALUSrcB = 11ALUOp = 00

ALUSrcA = 1ALUSrcB = 10ALUOp = 00

IorD = 1RegDst = 1

MemtoReg = 0RegWrite

IorD = 1MemWrite

ALUSrcA = 1ALUSrcB = 00ALUOp = 10

ALUSrcA = 1ALUSrcB = 00ALUOp = 01PCSrc = 1

Branch

Reset

S0: Fetch

S2: MemAdr

S1: Decode

S3: MemReadS5: MemWrite

S6: Execute

S7: ALUWriteback

S8: Branch

Op = LWor

Op = SWOp = R-type

Op = BEQ

Op = LWOp = SW

RegDst = 0MemtoReg = 1

RegWrite

S4: MemWriteback

Copyright © 2007 Elsevier 7-<63>

Main Controller FSM: addi

IorD = 0AluSrcA = 0

ALUSrcB = 01ALUOp = 00PCSrc = 0

IRWritePCWrite

ALUSrcA = 0ALUSrcB = 11ALUOp = 00

ALUSrcA = 1ALUSrcB = 10ALUOp = 00

IorD = 1RegDst = 1

MemtoReg = 0RegWrite

IorD = 1MemWrite

ALUSrcA = 1ALUSrcB = 00ALUOp = 10

ALUSrcA = 1ALUSrcB = 00ALUOp = 01PCSrc = 1

Branch

Reset

S0: Fetch

S2: MemAdr

S1: Decode

S3: MemReadS5: MemWrite

S6: Execute

S7: ALUWriteback

S8: Branch

Op = LWor

Op = SWOp = R-type

Op = BEQ

Op = LWOp = SW

RegDst = 0MemtoReg = 1

RegWrite

S4: MemWriteback

Op = ADDI

S9: ADDIExecute

S10: ADDIWriteback

Copyright © 2007 Elsevier 7-<64>

Main Controller FSM: addi

IorD = 0AluSrcA = 0

ALUSrcB = 01ALUOp = 00PCSrc = 0

IRWritePCWrite

ALUSrcA = 0ALUSrcB = 11ALUOp = 00

ALUSrcA = 1ALUSrcB = 10ALUOp = 00

IorD = 1RegDst = 1

MemtoReg = 0RegWrite

IorD = 1MemWrite

ALUSrcA = 1ALUSrcB = 00ALUOp = 10

ALUSrcA = 1ALUSrcB = 00ALUOp = 01PCSrc = 1

Branch

Reset

S0: Fetch

S2: MemAdr

S1: Decode

S3: MemReadS5: MemWrite

S6: Execute

S7: ALUWriteback

S8: Branch

Op = LWor

Op = SWOp = R-type

Op = BEQ

Op = LWOp = SW

RegDst = 0MemtoReg = 1

RegWrite

S4: MemWriteback

ALUSrcA = 1ALUSrcB = 10ALUOp = 00

RegDst = 0MemtoReg = 0

RegWrite

Op = ADDI

S9: ADDIExecute

S10: ADDIWriteback

Page 17: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

17

Copyright © 2007 Elsevier 7-<65>

Extended Functionality: j

SignImm

CLK

ARD

Instr / DataMemory

A1

A3

WD3

RD2RD1

WE3

A2

CLK

Sign Extend

RegisterFile

01

01PC 0

1

PC' Instr 25:21

20:16

15:0

SrcB20:16

15:11

<<2

ALUResult

SrcA

ALUOut

RegDst BranchMemWrite MemtoReg ALUSrcARegWrite

Zero

PCSrc1:0

CLK

ALUControl2:0

ALU

WD

WE

CLK

Adr

01

Data

CLK

CLK

AB 00

011011

4

CLK

ENEN

ALUSrcB1:0IRWriteIorD PCWrite

PCEn

000110

<<2

25:0 (jump)

31:28

27:0

PCJump

Copyright © 2007 Elsevier 7-<66>

Control FSM: j

IorD = 0AluSrcA = 0

ALUSrcB = 01ALUOp = 00PCSrc = 00

IRWritePCWrite

ALUSrcA = 0ALUSrcB = 11ALUOp = 00

ALUSrcA = 1ALUSrcB = 10ALUOp = 00

IorD = 1RegDst = 1

MemtoReg = 0RegWrite

IorD = 1MemWrite

ALUSrcA = 1ALUSrcB = 00ALUOp = 10

ALUSrcA = 1ALUSrcB = 00ALUOp = 01PCSrc = 01

Branch

Reset

S0: Fetch

S2: MemAdr

S1: Decode

S3: MemReadS5: MemWrite

S6: Execute

S7: ALUWriteback

S8: Branch

Op = LWor

Op = SWOp = R-type

Op = BEQ

Op = LWOp = SW

RegDst = 0MemtoReg = 1

RegWrite

S4: MemWriteback

ALUSrcA = 1ALUSrcB = 10ALUOp = 00

RegDst = 0MemtoReg = 0

RegWrite

Op = ADDI

S9: ADDIExecute

S10: ADDIWriteback

Op = J

S11: Jump

Copyright © 2007 Elsevier 7-<67>

Control FSM: j

IorD = 0AluSrcA = 0

ALUSrcB = 01ALUOp = 00PCSrc = 00

IRWritePCWrite

ALUSrcA = 0ALUSrcB = 11ALUOp = 00

ALUSrcA = 1ALUSrcB = 10ALUOp = 00

IorD = 1RegDst = 1

MemtoReg = 0RegWrite

IorD = 1MemWrite

ALUSrcA = 1ALUSrcB = 00ALUOp = 10

ALUSrcA = 1ALUSrcB = 00ALUOp = 01PCSrc = 01

Branch

Reset

S0: Fetch

S2: MemAdr

S1: Decode

S3: MemReadS5: MemWrite

S6: Execute

S7: ALUWriteback

S8: Branch

Op = LWor

Op = SWOp = R-type

Op = BEQ

Op = LWOp = SW

RegDst = 0MemtoReg = 1

RegWrite

S4: MemWriteback

ALUSrcA = 1ALUSrcB = 10ALUOp = 00

RegDst = 0MemtoReg = 0

RegWrite

Op = ADDI

S9: ADDIExecute

S10: ADDIWriteback

PCSrc = 10PCWrite

Op = J

S11: Jump

Copyright © 2007 Elsevier 7-<68>

Multicycle Performance

• Instructions take different number of cycles:– 3 cycles: beq, j– 4 cycles: R-Type, sw, addi– 5 cycles: lw

• CPI is weighted average• SPECINT2000 benchmark:

– 25% loads– 10% stores – 11% branches– 2% jumps– 52% R-type

Average CPI = (0.11 + 0.2)(3) + (0.52 + 0.10)(4) + (0.25)(5) = 4.12

Page 18: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

18

Copyright © 2007 Elsevier 7-<69>

Multicycle Performance

• Multicycle critical path:Tc =

SignImm

CLK

ARD

Instr / DataMemory

A1

A3

WD3

RD2RD1

WE3

A2

CLK

Sign Extend

RegisterFile

01

01 0

1

PC 01

PC' Instr 25:21

20:16

15:0

5:0

SrcB20:16

15:11

<<2

ALUResult

SrcA

ALUOut

31:26

RegD

st

Branch

MemWrite

Mem

toReg

ALUSrcARegWrite

OpFunct

ControlUnit

Zero

PCSrc

CLK

CLK

ALUControl2:0

ALU

WD

WE

CLK

Adr

01

Data

CLK

CLK

AB 00

011011

4

CLK

ENEN

ALUSrcB1:0IRWrite

IorD

PCWritePCEn

Copyright © 2007 Elsevier 7-<70>

Multicycle Performance

• Multicycle critical path:Tc = tpcq + tmux + max(tALU + tmux, tmem) + tsetup

SignImm

CLK

ARD

Instr / DataMemory

A1

A3

WD3

RD2RD1

WE3

A2

CLK

Sign Extend

RegisterFile

01

01 0

1

PC 01

PC' Instr 25:21

20:16

15:0

5:0

SrcB20:16

15:11

<<2

ALUResult

SrcA

ALUOut

31:26

RegD

st

Branch

MemWrite

Mem

toReg

ALUSrcARegWrite

OpFunct

ControlUnit

Zero

PCSrc

CLK

CLK

ALUControl2:0

ALU

WD

WE

CLK

Adr

01

Data

CLK

CLK

AB 00

011011

4

CLK

ENEN

ALUSrcB1:0IRWrite

IorD

PCWritePCEn

Copyright © 2007 Elsevier 7-<71>

Multicycle Performance Example

Tc =

30tpcqRegister clock-to-Q

Delay (ps)ParameterElement

20tsetupRegister setup

25tmuxMultiplexer

200tALUALU

250tmemMemory read

150tRFreadRegister file read

20tRFsetupRegister file setup

Copyright © 2007 Elsevier 7-<72>

Multicycle Performance Example

Tc = tpcq_PC + tmux + max(tALU + tmux, tmem) + tsetup

= tpcq_PC + tmux + tmem + tsetup

= [30 + 25 + 250 + 20] ps= 325 ps

30tpcq_PCRegister clock-to-Q

Delay (ps)ParameterElement

20tsetupRegister setup

25tmuxMultiplexer

200tALUALU

250tmemMemory read

150tRFreadRegister file read

20tRFsetupRegister file setup

Page 19: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

19

Copyright © 2007 Elsevier 7-<73>

Multicycle Performance Example

• For a program with 100 billion instructions executing on a multicycleMIPS processor– CPI = 4.12– Tc = 325 ps

Execution Time =

Copyright © 2007 Elsevier 7-<74>

Multicycle Performance Example

• For a program with 100 billion instructions executing on a multicycleMIPS processor– CPI = 4.12– Tc = 325 ps

Execution Time = (# instructions) × CPI × Tc

= (100 × 109)(4.12)(325 × 10-12)= 133.9 seconds

• This is slower than the single-cycle processor (95 seconds). Why?

Copyright © 2007 Elsevier 7-<75>

Multicycle Performance Example

• For a program with 100 billion instructions executing on a multicycleMIPS processor– CPI = 4.12– Tc = 325 ps

Execution Time = (# instructions) × CPI × Tc

= (100 × 109)(4.12)(325 × 10-12)= 133.9 seconds

• This is slower than the single-cycle processor (95 seconds). Why?– Not all steps the same length– Sequencing overhead for each step (tpcq + tsetup= 50 ps)

Copyright © 2007 Elsevier 7-<76>

Review: Single-Cycle MIPS Processor

SignImm

CLK

A RDInstruction

Memory

+4

A1

A3WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

01

01

A RDData

MemoryWD

WE01

PC01 PC' Instr 25:21

20:16

15:0

5:0

SrcB

20:16

15:11

<<2

+

ALUResult ReadData

WriteData

SrcA

PCPlus4

PCBranch

WriteReg4:0

Result

31:26

RegDst

BranchMemWriteMemtoReg

ALUSrc

RegWrite

OpFunct

ControlUnit

Zero

PCSrc

CLK

ALUControl2:0

ALU

01

25:0 <<2

27:0 31:28

PCJump

Jump

Page 20: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

20

Copyright © 2007 Elsevier 7-<77>

Review: Multicycle MIPS Processor

ImmExt

CLK

ARD

Instr / DataMemory

A1

A3

WD3

RD2RD1

WE3

A2

CLK

Sign Extend

RegisterFile

01

01PC 0

1

PC' Instr 25:21

20:16

15:0

SrcB20:16

15:11

<<2

ALUResult

SrcA

ALUOut

ZeroCLK

ALU

WD

WE

CLK

Adr

01

Data

CLK

CLK

AB 00

011011

4

CLK

ENEN000110

<<2

25:0 (Addr)

31:28

27:0

PCJump

5:0

31:26

Branch

MemWrite

ALUSrcARegWrite

OpFunct

ControlUnit

PCSrc

CLK

ALUControl2:0

ALUSrcB1:0IRWrite

IorD

PCWritePCEn

RegD

st

Mem

toReg

Copyright © 2007 Elsevier 7-<78>

Pipelined MIPS Processor

• Temporal parallelism• Divide single-cycle processor into 5 stages:

– Fetch– Decode– Execute– Memory– Writeback

• Add pipeline registers between stages

Copyright © 2007 Elsevier 7-<79>

Single-Cycle vs. Pipelined Performance

Time (ps)Instr

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead / Write

WriteReg1

2

0 100 200 300 400 500 600 700 800 900 1100 1200 1300 1400 1500 1600 1700 1800 19001000

Instr

1

2

3

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead / Write

WriteReg

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead/Write

WriteReg

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead/Write

WriteReg

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead/Write

WriteReg

Single-Cycle

Pipelined

Copyright © 2007 Elsevier 7-<80>

Pipelining Abstraction

Time (cycles)

lw $s2, 40($0) RF 40

$0RF

$s2+ DM

RF $t2

$t1RF

$s3+ DM

RF $s5

$s1RF

$s4- DM

RF $t6

$t5RF

$s5& DM

RF 20

$s1RF

$s6+ DM

RF $t4

$t3RF

$s7| DM

add $s3, $t1, $t2

sub $s4, $s1, $s5

and $s5, $t5, $t6

sw $s6, 20($s1)

or $s7, $t3, $t4

1 2 3 4 5 6 7 8 9 10

add

IM

IM

IM

IM

IM

IM lw

sub

and

sw

or

Page 21: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

21

Copyright © 2007 Elsevier 7-<81>

Single-Cycle and Pipelined Datapath

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

01

01

A RDData

MemoryWD

WE01

PCF01

PC' InstrD25:21

20:16

15:0

SrcBE

20:16

15:11

RtE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchM

ResultW

PCPlus4EPCPlus4F

ZeroM

CLK CLK

ALU

WriteRegE4:0

CLKCLK

CLK

SignImm

CLK

A RDInstruction

Memory

+4

A1

A3WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

01

01

A RDData

MemoryWD

WE01

PC01

PC' Instr 25:21

20:16

15:0

SrcB

20:16

15:11

<<2

+

ALUResult ReadData

WriteData

SrcA

PCPlus4

PCBranch

WriteReg4:0

Result

Zero

CLK

ALU

Fetch Decode Execute Memory WritebackCopyright © 2007 Elsevier 7-<82>

Corrected Pipelined Datapath

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

01

01

A RDData

MemoryWD

WE01

PCF01

PC' InstrD 25:21

20:16

15:0

SrcBE

20:16

15:11

RtE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4EPCPlus4F

ZeroM

CLK CLK

WriteRegW4:0

ALU

WriteRegE4:0

CLKCLK

CLK

Fetch Decode Execute Memory Writeback

• WriteReg must arrive at the same time as Result

Copyright © 2007 Elsevier 7-<83>

Pipelined Control

SignImmE

CLK

A RDInstruction

Memory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

01

01

A RDData

MemoryWD

WE01

PCF01

PC' InstrD25:21

20:16

15:0

5:0

SrcBE

20:16

15:11

RtE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4EPCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

ZeroM

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU

RegWriteE RegWriteM RegWriteW

MemtoRegE MemtoRegM MemtoRegW

MemWriteE MemWriteM

BranchE BranchM

RegDstE

ALUSrcE

WriteRegE4:0

Same control unit as single-cycle processor

Control delayed to proper pipeline stageCopyright © 2007 Elsevier 7-<84>

Pipeline Hazard

• Occurs when an instruction depends on results from previous instruction that hasn’t completed.

• Types of hazards:– Data hazard: register value not written back to register

file yet– Control hazard: next instruction not decided yet

(caused by branches)

Page 22: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

22

Copyright © 2007 Elsevier 7-<85>

Data Hazard

Time (cycles)

add $s0, $s2, $s3 RF $s3

$s2RF

$s0+ DM

RF $s1

$s0RF

$t0& DM

RF $s0

$s4RF

$t1| DM

RF $s5

$s0RF

$t2- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IM add

or

sub

Copyright © 2007 Elsevier 7-<86>

Handling Data Hazards

• Insert nops in code at compile time• Rearrange code at compile time• Forward data at run time• Stall the processor at run time

Copyright © 2007 Elsevier 7-<87>

Compile-Time Hazard Elimination

Time (cycles)

add $s0, $s2, $s3 RF $s3

$s2RF

$s0+ DM

RF $s1

$s0RF

$t0& DM

RF $s0

$s4RF

$t1| DM

RF $s5

$s0RF

$t2- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IM add

or

sub

nop

nop

RF RFDMnopIM

RF RFDMnopIM

9 10

• Insert enough nops for result to be ready• Or move independent useful instructions forward

Copyright © 2007 Elsevier 7-<88>

Data Forwarding

Time (cycles)

add $s0, $s2, $s3 RF $s3

$s2RF

$s0+ DM

RF $s1

$s0RF

$t0& DM

RF $s0

$s4RF

$t1| DM

RF $s5

$s0RF

$t2- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IM add

or

sub

Page 23: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

23

Copyright © 2007 Elsevier 7-<89>

Data Forwarding

SignImmE

CLK

A RDInstruction

Memory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

01

01

A RDData

MemoryWD

WE

10

PCF01

PC' InstrD 25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU

RegWriteE RegWriteM RegWriteW

MemtoRegE MemtoRegM MemtoRegW

MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

SignImmD

Forw

ardA

E

Forw

ardB

E

20:16 RtE

RsD

RdD

RtD

Reg

Writ

eM

Reg

Writ

eW

Hazard Unit

PCPlus4E

BranchE BranchM

ZeroM

Copyright © 2007 Elsevier 7-<90>

Data Forwarding

• Forward to Execute stage from either:– Memory stage or– Writeback stage

• Forwarding logic for ForwardAE:if ((rsE != 0) AND (rsE == WriteRegM) AND RegWriteM) then ForwardAE = 10else if ((rsE != 0) AND (rsE == WriteRegW) AND RegWriteW) then ForwardAE = 01else ForwardAE = 00

• Forwarding logic for ForwardBE same, but replace rsEwith rtE

Copyright © 2007 Elsevier 7-<91>

Stalling

Time (cycles)

lw $s0, 40($0) RF 40

$0RF

$s0+ DM

RF $s1

$s0RF

$t0& DM

RF $s0

$s4RF

$t1| DM

RF $s5

$s0RF

$t2- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IM lw

or

sub

Trouble!

Copyright © 2007 Elsevier 7-<92>

Stalling

Time (cycles)

lw $s0, 40($0) RF 40

$0RF

$s0+ DM

RF $s1

$s0RF

$t0& DM

RF $s0

$s4RF

$t1| DM

RF $s5

$s0RF

$t2- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IM lw

or

sub

9

RF $s1

$s0

IM or

Stall

Page 24: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

24

Copyright © 2007 Elsevier 7-<93>

Stalling Hardware

SignImmE

CLK

A RDInstruction

Memory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

01

01

A RDData

MemoryWD

WE

10

PCF01

PC' InstrD 25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU

RegWriteE RegWriteM RegWriteW

MemtoRegE MemtoRegM MemtoRegW

MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

SignImmD

Stal

lF

Stal

lD

Forw

ardA

E

Forw

ardB

E

20:16 RtE

RsD

RdD

RtD

Reg

Writ

eM

Reg

Writ

eW

Mem

toR

egE

Hazard Unit

Flus

hE

PCPlus4E

BranchE BranchM

ZeroM

EN

EN

CLR

Copyright © 2007 Elsevier 7-<94>

Stalling Hardware

• Stalling logic:lwstall = ((rsD == rtE) OR (rtD == rtE)) AND MemtoRegE

StallF = StallD = FlushE = lwstall

Copyright © 2007 Elsevier 7-<95>

Control Hazards

• beq: – branch is not determined until the fourth stage of the pipeline– Instructions after the branch are fetched before branch occurs– These instructions must be flushed if the branch happens

• Branch misprediction penalty– number of instruction flushed when branch is taken– May be reduced by determining branch earlier

Copyright © 2007 Elsevier 7-<96>

Control Hazards: Original Pipeline

SignImmE

CLK

A RDInstruction

Memory

+4

A1

A3WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

01

01

A RDData

MemoryWD

WE

10

PCF01

PC' InstrD 25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU

RegWriteE RegWriteM RegWriteW

MemtoRegE MemtoRegM MemtoRegW

MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

SignImmD

Stal

lF

Stal

lD

Forw

ardA

E

Forw

ardB

E

20:16 RtE

RsD

RdD

RtD

Reg

Writ

eM

Reg

Writ

eW

Mem

toR

egE

Hazard Unit

Flus

hE

PCPlus4E

BranchE BranchM

ZeroM

EN

EN

CLR

Page 25: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

25

Copyright © 2007 Elsevier 7-<97>

Control Hazards

Time (cycles)

beq $t1, $t2, 40 RF $t2

$t1RF- DM

RF $s1

$s0RF& DM

RF $s0

$s4RF| DM

RF $s5

$s0RF- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IM lw

or

sub

20

24

28

2C

30

...

...

9

Flushthese

instructions

64 slt $t3, $s2, $s3 RF $s3

$s2RF

$t3slt DMIM slt

Copyright © 2007 Elsevier 7-<98>

Control Hazards: Early Branch Resolution

EqualD

SignImmE

CLK

A RDInstruction

Memory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

01

01

A RDData

MemoryWD

WE

10

PCF01

PC' InstrD 25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchD

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcD

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU

RegWriteE RegWriteM RegWriteW

MemtoRegE MemtoRegM MemtoRegW

MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

=

SignImmD

Stal

lF

Stal

lD

Forw

ardA

E

Forw

ardB

E

20:16 RtE

RsD

RdE

RtD

Reg

Writ

eM

Reg

Writ

eW

Mem

toR

egE

Hazard Unit

Flus

hE

EN

EN

CLR

CLR

Introduced another data hazard in Decode stage

Copyright © 2007 Elsevier 7-<99>

Control Hazards with Early Branch Resolution

Time (cycles)

beq $t1, $t2, 40 RF $t2

$t1RF- DM

RF $s1

$s0RF& DMand $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

andIM

IM lw20

24

28

2C

30

...

...

9

Flushthis

instruction

64 slt $t3, $s2, $s3 RF $s3

$s2RF

$t3slt DMIM slt

Copyright © 2007 Elsevier 7-<100>

Handling Data and Control Hazards

EqualD

SignImmE

CLK

A RDInstruction

Memory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

01

01

A RDData

MemoryWD

WE

10

PCF01

PC' InstrD 25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchD

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcD

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU

RegWriteE RegWriteM RegWriteW

MemtoRegE MemtoRegM MemtoRegW

MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

01

01

=

SignImmD

Stal

lF

Stal

lD

Forw

ardA

E

Forw

ardB

E

Forw

ardA

D

Forw

ardB

D

20:16 RtE

RsD

RdD

RtD

Reg

Writ

eE

Reg

Writ

eM

Reg

Writ

eW

Mem

toR

egE

Bran

chD

Hazard Unit

Flus

hE

EN

EN

CLR

CLR

Page 26: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

26

Copyright © 2007 Elsevier 7-<101>

Control Forwarding and Stalling Hardware

• Forwarding logic:ForwardAD = (rsD !=0) AND (rsD == WriteRegM) AND RegWriteM

ForwardBD = (rtD !=0) AND (rtD == WriteRegM) AND RegWriteM

• Stalling logic:branchstall = BranchD AND RegWriteE AND

(WriteRegE == rsD OR WriteRegE == rtD) OR BranchD AND MemtoRegM AND

(WriteRegM == rsD OR WriteRegM == rtD)

StallF = StallD = FlushE = lwstall OR branchstall

Copyright © 2007 Elsevier 7-<102>

Branch Prediction

• Guess whether branch will be taken– Backward branches are usually taken (loops)– Perhaps consider history of whether branch was

previously taken to improve the guess

• Good prediction reduces the fraction of branches requiring a flush

Copyright © 2007 Elsevier 7-<103>

Pipelined Performance Example

• Ideally CPI = 1• But need to handle stalling (caused by loads and branches)• SPECINT2000 benchmark:

– 25% loads– 10% stores – 11% branches– 2% jumps– 52% R-type

• Suppose:– 40% of loads used by next instruction– 25% of branches mispredicted

• What is the average CPI?

Copyright © 2007 Elsevier 7-<104>

Pipelined Performance Example

• SPECINT2000 benchmark: – 25% loads– 10% stores – 11% branches– 2% jumps– 52% R-type

• Suppose:– 40% of loads used by next instruction– 25% of branches mispredicted– All jumps flush next instruction

• What is the average CPI?– Load/Branch CPI = 1 when no stalling, 2 when stalling. Thus,– CPIlw = 1(0.6) + 2(0.4) = 1.4– CPIbeq = 1(0.75) + 2(0.25) = 1.25– Thus,

Average CPI = (0.25)(1.4) + (0.1)(1) + (0.11)(1.25) + (0.02)(2) + (0.52)(1)

= 1.15

Page 27: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

27

Copyright © 2007 Elsevier 7-<105>

Pipelined Performance

• Pipelined processor critical path:Tc = max {

tpcq + tmem + tsetup

2(tRFread + tmux + teq + tAND + tmux + tsetup )tpcq + tmux + tmux + tALU + tsetup

tpcq + tmemwrite + tsetup

2(tpcq + tmux + tRFwrite) }

Copyright © 2007 Elsevier 7-<106>

Pipelined Performance Example

100 pstRFwriteRegister file write220TmemwriteMemory write15tANDAND gate40teqEquality comparator

30tpcq_PCRegister clock-to-QDelay (ps)ParameterElement

20tsetupRegister setup25tmuxMultiplexer200tALUALU250tmemMemory read150tRFreadRegister file read20tRFsetupRegister file setup

Tc = 2(tRFread + tmux + teq + tAND + tmux + tsetup )= 2[150 + 25 + 40 + 15 + 25 + 20] ps = 550 ps

Copyright © 2007 Elsevier 7-<107>

Pipelined Performance Example

• For a program with 100 billion instructions executing on a pipelined MIPS processor,

• CPI = 1.15• Tc = 550 ps

Execution Time = (# instructions) × CPI × Tc

= (100 × 109)(1.15)(550 × 10-12)= 63 seconds

1.5163Pipelined0.71133Multicycle195Single-cycle

Speedup(single-cycle is baseline)

Execution Time(seconds)Processor

Copyright © 2007 Elsevier 7-<108>

Review: Exceptions

• Unscheduled procedure call to the exception handler• Casued by:

– Hardware, also called an interrupt, e.g. keyboard– Software, also called traps, e.g. undefined instruction

• When exception occurs, the processor:– Records the cause of the exception (Cause register)– Jumps to the exception handler at instruction address 0x80000180– Returns to program (EPC register)

Page 28: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

28

Copyright © 2007 Elsevier 7-<109>

Example Exception

Copyright © 2007 Elsevier 7-<110>

Exception Registers

• Not part of the register file.– Cause

• Records the cause of the exception• Coprocessor 0 register 13

– EPC (Exception PC)• Records the PC where the exception occurred• Coprocessor 0 register 14

• Move from Coprocessor 0– mfc0 $t0, Cause– Moves the contents of Cause into $t0

00000 $t0 (8) Cause (13) 00000000000

mfc0

31:26 25:21 20:16 15:11 10:0

010000

Copyright © 2007 Elsevier 7-<111>

Exception Causes

0x00000030Arithmetic Overflow

0x00000028Undefined Instruction

0x00000024Breakpoint / Divide by 0

0x00000020System Call

0x00000000Hardware Interrupt

CauseException

We extend the multicycle MIPS processor to handle the last two types of exceptions.

Copyright © 2007 Elsevier 7-<112>

Exception Hardware: EPC & Cause

SignImm

CLK

ARD

Instr / DataMemory

A1

A3

WD3

RD2RD1

WE3

A2

CLK

Sign Extend

RegisterFile

01

01PC 0

1

PC' Instr 25:21

20:16

15:0

SrcB20:16

15:11

<<2

ALUResult

SrcA

ALUOut

RegDst BranchMemWrite MemtoReg ALUSrcARegWrite

Zero

PCSrc1:0

CLK

ALUControl2:0

ALU

WD

WE

CLK

Adr

01

Data

CLK

CLK

AB 00

011011

4

CLK

ENEN

ALUSrcB1:0IRWriteIorD PCWrite

PCEn

<<2

25:0 (jump)

31:28

27:0

PCJump

00011011

0x8000 0180

Overflow

CLK

EN

EPCWrite

CLK

EN

CauseWrite

01

IntCause

0x30

0x28EPC

Cause

Page 29: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

29

Copyright © 2007 Elsevier 7-<113>

Exception Hardware: mfc0

SignImm

CLK

ARD

Instr / DataMemory

A1

A3

WD3

RD2RD1

WE3

A2

CLK

Sign Extend

RegisterFile

01

01PC 0

1

PC' Instr 25:21

20:16

15:0

SrcB20:16

15:11

<<2

ALUResult

SrcA

ALUOut

RegDst BranchMemWrite MemtoReg1:0 ALUSrcARegWrite

Zero

PCSrc1:0

CLK

ALUControl2:0

ALU

WD

WE

CLK

Adr

0001

Data

CLK

CLK

AB 00

011011

4

CLK

ENEN

ALUSrcB1:0IRWriteIorD PCWrite

PCEn

<<2

25:0 (jump)

31:28

27:0

PCJump

00011011

0x8000 0180

CLK

EN

EPCWrite

CLK

EN

CauseWrite

01

IntCause

0x30

0x28EPC

Cause

Overflow

...01101

01110

...15:11

10

C0

Copyright © 2007 Elsevier 7-<114>

Control FSM with Exceptions

IorD = 0AluSrcA = 0

ALUSrcB = 01ALUOp = 00PCSrc = 00

IRWritePCWrite

ALUSrcA = 0ALUSrcB = 11ALUOp = 00

ALUSrcA = 1ALUSrcB = 10ALUOp = 00

IorD = 1RegDst = 1

MemtoReg = 00RegWrite

IorD = 1MemWrite

ALUSrcA = 1ALUSrcB = 00ALUOp = 10

ALUSrcA = 1ALUSrcB = 00ALUOp = 01PCSrc = 01

Branch

Reset

S0: Fetch

S2: MemAdr

S1: Decode

S3: MemReadS5: MemWrite

S6: Execute

S7: ALUWriteback

S8: Branch

Op = LWor

Op = SWOp = R-type

Op = BEQ

Op = LWOp = SW

RegDst = 0MemtoReg = 01

RegWrite

S4: MemWriteback

ALUSrcA = 1ALUSrcB = 10ALUOp = 00

RegDst = 0MemtoReg = 00

RegWrite

Op = ADDI

S9: ADDIExecute

S10: ADDIWriteback

PCSrc = 10PCWrite

Op = J

S11: Jump

Overflow OverflowS13:

OverflowPCSrc = 11

PCWriteIntCause = 0CauseWriteEPCWrite

Op = others

PCSrc = 11PCWrite

IntCause = 1CauseWriteEPCWrite

S12: Undefined

RegDst = 0Memtoreg = 10

RegWrite

Op = mfc0

S14: MFC0

Copyright © 2007 Elsevier 7-<115>

Advanced MicroArchitecture

• Deep Pipelining• Branch Prediction• Superscalar Processors• Out of Order Processors• Register Renaming• SIMD• Multithreading• Multiprocessors

Copyright © 2007 Elsevier 7-<116>

Deep Pipelining

• 10-20 stages typical• Number of stages limited by:

– Pipeline hazards– Sequencing overhead– Power– Cost

Page 30: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

30

Copyright © 2007 Elsevier 7-<117>

Branch Prediction

• Ideal pipelined processor: CPI = 1• Branch misprediction increases CPI• Static branch prediction:

– Check direction of branch (forward or backward)– If backward, predict taken– Otherwise, predict not taken

• Dynamic branch prediction:– Keep history of last (several hundred) branches in a

branch target buffer which holds:• Branch destination• Whether branch was taken

Copyright © 2007 Elsevier 7-<118>

Branch Prediction Example

add $s1, $0, $0 # sum = 0add $s0, $0, $0 # i = 0addi $t0, $0, 10 # $t0 = 10

for:beq $t0, $t0, done # if i == 10, branchadd $s1, $s1, $s0 # sum = sum + iaddi $s0, $s0, 1 # increment ij for

done:

Copyright © 2007 Elsevier 7-<119>

1-Bit Branch Predictor

• Remembers whether branch was taken the last time and does the same thing

• Mispredicts first and last branch of loop

Copyright © 2007 Elsevier 7-<120>

2-Bit Branch Predictor

• Only mispredicts last branch of loop

stronglytaken

predicttaken

weaklytaken

predicttaken

weaklynot taken

predictnot taken

stronglynot taken

predictnot taken

taken taken taken

takentakentaken

taken

taken

Page 31: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

31

Copyright © 2007 Elsevier 7-<121>

Superscalar

• Multiple copies of datapath execute multiple instructions at once

• Dependencies make it tricky to issue multiple instructions at once

CLK CLK CLK CLK

ARD A1

A2RD1A3

WD3WD6

A4A5A6

RD4

RD2RD5

InstructionMemory

RegisterFile Data

Memory

ALU

s

PC

CLK

A1A2

WD1WD2

RD1RD2

Copyright © 2007 Elsevier 7-<122>

Superscalar Example

lw $t0, 40($s0)add $t1, $t0, $s1sub $t0, $s2, $s3 Ideal IPC: 2and $t2, $s4, $t0 Actual IPC: 2or $t3, $s5, $s6sw $s7, 80($t3)

Time (cycles)

1 2 3 4 5 6 7 8

RF40

$s0

RF

$t0+

DMIM

lw

add

lw $t0, 40($s0)

add $t1, $s1, $s2

sub $t2, $s1, $s3

and $t3, $s3, $s4

or $t4, $s1, $s5

sw $s5, 80($s0)

$t1$s2

$s1

+

RF$s3

$s1

RF

$t2-

DMIM

sub

and $t3$s4

$s3&

RF$s5

$s1

RF

$t4|

DMIM

or

sw80

$s0

+ $s5

Copyright © 2007 Elsevier 7-<123>

Superscalar Example with Dependencies

lw $t0, 40($s0)add $t1, $t0, $s1sub $t0, $s2, $s3 Ideal IPC: 2and $t2, $s4, $t0 Actual IPC: 6/5 = 1.17or $t3, $s5, $s6sw $s7, 80($t3)

Stall

Time (cycles)

1 2 3 4 5 6 7 8

RF40

$s0

RF

$t0+

DMIM

lwlw $t0, 40($s0)

add $t1, $t0, $s1

sub $t0, $s2, $s3

and $t2, $s4, $t0

sw $s7, 80($t3)

RF$s1

$t0add

RF$s1

$t0

RF

$t1+

DM

RF$t0

$s4

RF

$t2&

DMIM

and

IMor

and

sub

|$s6

$s5$t3

RF80

$t3

RF+

DMsw

IM

$s7

9

$s3

$s2

$s3

$s2-

$t0

oror $t3, $s5, $s6

IM

Copyright © 2007 Elsevier 7-<124>

Out of Order Processor

• Looks ahead across multiple instructions to issue as many as possible at once

• Issues instructions out of order as long as no dependencies• Dependencies:

– RAW (read after write): one instruction writes, and later instruction reads a register

– WAR (write after read): one instruction reads, and a later instruction writes a register (also called an antidependence)

– WAW (write after write): one instruction writes, and a later instruction writes a register (also called an output dependence)

Page 32: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

32

Copyright © 2007 Elsevier 7-<125>

Out of Order Processor

• Instruction level parallelism: the number of instruction that can be issued simultaneously (in practice average < 3)

• Scoreboard: table that keeps track of:– Instructions waiting to issue– Available functional units– Dependencies

Copyright © 2007 Elsevier 7-<126>

Out of Order Processor Example

lw $t0, 40($s0)add $t1, $t0, $s1sub $t0, $s2, $s3 Ideal IPC: 2and $t2, $s4, $t0 Actual IPC: 6/4 = 1.5or $t3, $s5, $s6sw $s7, 80($t3)

Time (cycles)

1 2 3 4 5 6 7 8

RF40

$s0

RF

$t0+

DMIM

lwlw $t0, 40($s0)

add $t1, $t0, $s1

sub $t0, $s2, $s3

and $t2, $s4, $t0

sw $s7, 80($t3)

or|$s6

$s5$t3

RF80

$t3

RF+

DMsw $s7

or $t3, $s5, $s6

IM

RF$s1

$t0

RF

$t1+

DMIM

add

sub-$s3

$s2$t0

two cycle latencybetween load anduse of $t0

RAW

WAR

RAW

RF$t0

$s4

RF&

DMand

IM

$t2

RAW

Copyright © 2007 Elsevier 7-<127>

Register Renaming

Time (cycles)

1 2 3 4 5 6 7

RF40

$s0

RF

$t0+

DMIM

lwlw $t0, 40($s0)

add $t1, $t0, $s1

sub $r0, $s2, $s3

and $t2, $s4, $r0

sw $s7, 80($t3)

sub-$s3

$s2$r0

RF$r0

$s4

RF&

DMand

$s7

or $t3, $s5, $s6IM

RF$s1

$t0

RF

$t1+

DMIM

add

sw+80

$t3

RAW

$s6

$s5|

or

2-cycle RAW

RAW

$t2

$t3

lw $t0, 40($s0)add $t1, $t0, $s1sub $t0, $s2, $s3 Ideal IPC: 2and $t2, $s4, $t0 Actual IPC: 6/3 = 2or $t3, $s5, $s6sw $s7, 80($t3)

Copyright © 2007 Elsevier 7-<128>

SIMD

• Single Instruction Multiple Data (SIMD)– Single instruction acts on multiple pieces of data at once– Common application: graphics– Perform short arithmetic operations (also called packed

arithmetic)

• For example, add four 8-bit elements• Must modify ALU to eliminate carries between 8-

bit valuespadd8 $s2, $s0, $s1

a0

0781516232432 Bit position

$s0a1a2a3

b0 $s1b1b2b3

a0 + b0 $s2a1 + b1a2 + b2a3 + b3

+

Page 33: Introduction Performance Analysis Single-Cycle Processor …pages.hmc.edu/harris/class/e85/old/fall07/DDCA_Ch7.pdf · 2020. 3. 2. · Single-Cycle Datapath: R-type instructions •

33

Copyright © 2007 Elsevier 7-<129>

Multithreading: First Some Definitions

• Process: program running on a computer• Multiple processes can run at once: e.g., surfing Web,

playing music, writing a paper• Thread: part of a program• Each process has multiple threads: e.g., a word processor

may have threads for typing, spell checking, printing• In a conventional processor:

– One thread runs at once– When one thread stalls (for example, waiting for memory):

• Architectural state of that thread is stored• Architectural state of waiting thread is loaded into processor and it runs• Called context switching

– But appears to user like all threads running simultaneously

Copyright © 2007 Elsevier 7-<130>

Multithreading

• Multiple copies of architectural state• Multiple threads active at once:

– When one thread stalls, another runs immediately (no need to store or restore architectural state)

– If one thread can’t keep all execution units busy, another thread can use them

• Does not increase ILP of single thread, but does increase throughput

Copyright © 2007 Elsevier 7-<131>

Multiprocessors

• Multiple processors (cores) with a method of communication between them

• Types of multiprocessing:– Symmetric multiprocessing (SMT): multiple cores with a shared

memory– Asymmetric multiprocessing: separate cores for different tasks

(for example, DSP and CPU in cell phone)– Clusters: each core has its own memory system

Copyright © 2007 Elsevier 7-<132>

Other Resources

• Patterson & Hennessy’s: Computer Architecture: A Quantitative Approach

• Conferences:– www.cs.wisc.edu/~arch/www/

– ISCA (International Symposium on Computer Architecture)

– HPCA (International Symposium on High Performance Computer Architecture)