LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

47
LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology

Transcript of LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

Page 1: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

LOGO

Computer Architecture Dr. Esam Al_Qaralleh

Princess Sumaya University for Technology

Page 2: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

2

Data path

Page 3: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

3

The processor : Data Path and Control

PCRegister

Bank

Data Memory

Address

Instructions Address

Data

Instruction Memory

A L U

Data

Register #

Register #

Register #

Two types of functional units:elements that operate on data values (combinational) elements that contain state (state elements)

Page 4: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

4

Single Cycle Implementation

State element

1

State element

2

Combinational Logic

Clock Cycle

Typical execution:read contents of some state elements,send values through some combinational logicwrite results to one or more state elements

Using a clock signal for synchronization Edge triggered methodology

Page 5: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

5

A portion of the datapath used for fetching instructions

Page 6: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

6

The datapath for R-type instructions

Page 7: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

7

The datapath for load and store insructions

Page 8: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

8

The datapath for branch instructions

Page 9: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

9

Complete Data Path

Page 10: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

10

Control

Selecting the operations to perform (ALU, read/write, etc.)

Controlling the flow of data (multiplexor inputs) Information comes from the 32 bits of the instruction Example: lW $1, 100($2)

Value of control signals is dependent upon: what instruction is being executed which step is being performed

Page 11: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

11

Data Path with Control

Page 12: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

12

Single Cycle Implementation

Calculate cycle time assuming negligible delays except:memory (2ns), ALU and adders (2ns), register file access (1ns)

Page 13: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

13

Multicycle approach

Single Cycle Problems: what if we had a more complicated instruction like floating point? wasteful of area

One Solution: use a “smaller” cycle time have different instructions take different numbers of cycles a “multicycle” datapath:

Page 14: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

14

Multicycle Approach

Break up the instructions into steps, each step takes a cycle balance the amount of work to be done restrict each cycle to use only one major

functional unit

At the end of a cycle store values for use in later cycles (easiest

thing to do) introduce additional “internal” registers

Page 15: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

15

Multicycle Approach Implementation

Page 16: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

16

Five Execution Steps

Instruction FetchInstruction Decode and Register FetchExecution, Memory Address Computation, or

Branch CompletionMemory Access or R-type instruction

completionWrite-back step

INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!

Page 17: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

17

Five Execution Steps

Step nameStep name Action for R-type Action for R-type instructionsinstructions

Action for Memory-Action for Memory-reference Instructionsreference Instructions

Action for Action for branchesbranches

Action for Action for jumpsjumps

Instruction fetch IR = MEM[PC]

PC = PC + 4

Instruction decode/ register fetch

A = Reg[IR[25-21]]

B = Reg[IR[20-16]]

ALUOut = PC + (sign extend (IR[15-0])<<2)

Execution, address computation, branch/jump

completion

ALUOut = A op B ALUOut = A+sign extend(IR[15-0])

IF(A==B) Then

PC=ALUOut

PC=PC[31-28]||(IR[25-

0]<<2)

Memory access or R-type completion

Reg[IR[15-11]] = ALUOut

Load:MDR =Mem[ALUOut]

or

Store:Mem[ALUOut] = B

Memory read completion Load: Reg[IR[20-16]] = MDR

Page 18: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

18

pipelining

Page 19: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

19

Pipelining is Natural!

Pipelining provides a method for executing multiple instructions at the same time.

Laundry ExampleAnn, Brian, Cathy, Dave

each have one load of clothes to wash, dry, and fold

Washer takes 30 minutes

Dryer takes 40 minutes

“Folder” takes 20 minutes

A B C D

Page 20: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

20

Sequential Laundry

Sequential laundry takes 6 hours for 4 loadsIf they learned pipelining, how long would

laundry take?

A

B

C

D

30 40 20 30 40 20 30 40 20 30 40 20

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

Page 21: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

21

Pipelined Laundry: Start work ASAP

Pipelined laundry takes 3.5 hours for 4 loads

A

B

C

D

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

30 40 40 40 40 20

Page 22: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

22

Pipelining Lessons

Pipelining doesn’t help latency of single task, it helps throughput of entire workload

Pipeline rate limited by slowest pipeline stage

Multiple tasks operating simultaneously using different resources

Potential speedup = Number pipe stages

Unbalanced lengths of pipe stages reduces speedup

Time to “fill” pipeline and time to “drain” it reduces speedup

Stall for Dependences

A

B

C

D

6 PM 7 8 9

Task

Order

Time

30 40 40 40 40 20

Page 23: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

23

Basic DataPath

•What do we need to add to actually split the datapath into stages?

Page 24: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

24

Pipeline DataPath

Page 25: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

25

The Five Stages of the Load Instruction

Ifetch: Instruction FetchFetch the instruction from the Instruction Memory

Reg/Dec: Registers Fetch and Instruction Decode

Exec: Calculate the memory addressMem: Read the data from the Data Memory

Wr: Write the data back to the register file

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Ifetch Reg/Dec Exec Mem WrLoad

Page 26: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

26

Pipelined Execution

On a processor multiple instructions are in various stages at the same time.

Assume each instruction takes five cycles

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WBProgram Flow

Time

Page 27: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

27

Single Cycle, Multiple Cycle, vs. Pipeline

Page 28: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

28

Graphically Representing Pipelines

Can help with answering questions like: How many cycles does it take to execute this code? What is the ALU doing during cycle 4? Are two instructions trying to use the same resource

at the same time?

Page 29: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

29

Why Pipeline? Because the resources are there!

Page 30: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

30

Pipelining MIPS Execution

Page 31: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

31

Observations

Use separate instruction and data memories To reduce memory conflict The memory system must deliver 5 times the

bandwidth, if the pipelined processor has a clock cycle that is equal to the unpipelined processor.

Register file is used in the two stages One for reading in ID and one for writing in WB 2 reads 1 write in every clock cycle

• Maintaining PC IF stage: preparing the PC for the next instruction ID stage: to compute the branch target address Problem of a branch does not change the PC until the

ID stage

Page 32: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

32

Why Pipeline?

Suppose 100 instructions are executed The single cycle machine has a cycle time of 45 ns The multicycle and pipeline machines have cycle times of 10

ns The multicycle machine has a CPI of 4.6

Single Cycle Machine 45 ns/cycle x 1 CPI x 100 inst = 4500 ns

Multicycle Machine 10 ns/cycle x 4.6 CPI x 100 inst = 4600 ns

Ideal pipelined machine 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns

Ideal pipelined vs. single cycle speedup 4500 ns / 1040 ns = 4.33

What has not yet been considered?

Page 33: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

33

Compare Performance

Compare: Single-cycle, multicycle and pipelined control using SPECint2000 Single-cycle: memory access = 200ps, ALU = 100ps, register file read

and write = 50ps 200+50+100+200+50=600ps

Multicycle: 25% loads, 10% stores, 11% branches, 2% jumps, 52% ALU CPI = 4.12, The clock cycle = 200ps (longest functional unit)

Pipelined 1 clock cycle when there is no load-use dependence 2 when there is, average 1.5 per load Stores and ALU take 1 clock cycle Branches - 1 when predicted correctly and 2 when not, average 1.25 Jump – 2 1.5x25%+1x10%+1x52%+1.25x11%+2x2% = 1.17

Average instruction time: single-cycle = 600ps, multicycle = 4.12x200=824, pipelined 1.17x200 = 234ps

Memory access 200ps is the bottleneck. How to improve?

Page 34: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

34

Can pipelining get us into trouble?

Yes: Pipeline Hazards structural hazards: attempt to use the same resource two

different ways at the same time• E.g., two instructions try to read the same memory at the same

time data hazards: attempt to use item before it is ready

• instruction depends on result of prior instruction still in the pipeline

add r1, r2, r3sub r4, r2, r1

control hazards: attempt to make a decision before condition is evaulated

• branch instructionsbeq r1, loopadd r1, r2, r3

Can always resolve hazards by waiting pipeline control must detect the hazard take action (or delay action) to resolve hazards

Page 35: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

35

Mem

Single Memory is a Structural Hazard

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Instr 3

Instr 4A

LUMem Reg Mem Reg

AL

UMem Reg Mem Reg

AL

UMem Reg Mem Reg

AL

UReg Mem Reg

AL

UMem Reg Mem Reg

Detection is easy in this case! (right half highlight means read, left half write)

Page 36: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

36

Structural Hazards limit performance

Example: if 1.3 memory accesses per instruction and only one memory access per cycle then

average CPI = 1.3otherwise resource is more than 100% utilized

Solution 1: Use separate instruction and data memories

Solution 2: Allow memory to read and write more than one word per cycle

Solution 3: Stall

Page 37: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

37

Stall: wait until decision is clear Its possible to move up decision to 2nd stage by

adding hardware to check registers as being read

Impact: 2 clock cycles per branch instruction => slow

Control Hazard Solutions

Instr.

Order

Time (clock cycles)

Add

Beq

Load

AL

UMem Reg Mem Reg

AL

UMem Reg Mem Reg

AL

UReg Mem RegMem

Page 38: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

38

Predict: guess one direction then back up if wrong

Predict not taken

Impact: 1 clock cycle per branch instruction if right, 2 if wrong (right 50% of time)

More dynamic scheme: history of 1 branch ( 90%)

Control Hazard Solutions

Instr.

Order

Time (clock cycles)

Add

Beq

Load

AL

UMem Reg Mem Reg

AL

UMem Reg Mem Reg

MemA

LUReg Mem Reg

Page 39: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

39

Redefine branch behavior (takes place after next instruction) “delayed branch”

Impact: 1 clock cycles per branch instruction if can find instruction to put in “slot” ( 50% of time)

Launch more instructions per clock cycle=>less useful

Control Hazard Solutions

Instr.

Order

Time (clock cycles)

Add

Beq

Misc

AL

UMem Reg Mem Reg

AL

UMem Reg Mem Reg

MemA

LUReg Mem Reg

Load Mem

AL

UReg Mem Reg

Page 40: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

40

Data Hazard

Page 41: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

41

Data Hazard---Forwarding

Use temporary results, don’t wait for them to be writtenregister file forwarding to handle read/write to same registerALU forwarding

Page 42: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

42

Can’t always forward

Load word can still cause a hazard: an instruction tries to read a register following a load instruction that writes

to the same register.

Thus, we need a hazard detection unit to “stall” the load instruction

Page 43: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

43

Stalling

We can stall the pipeline by keeping an instruction in the same stage

Page 44: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

44

Designing Instruction Set for Pipelining

MIPS instructions are the same length. This makes much easer to fetch instructions in the

first stage and to decode them in the second stage.In IA-32 instructions vary from 1-17 bytes,

pipelining is considerably more challenging. All recent IA-32 architectures translate instructions

into micro_operations, which are pipelinedMIPS has only few instruction formats, with the

source registers fields in the same place This symmetry – the second stage can read the

register file at the same time that the hardware is decoding the type of instruction

Without this symmetry we would need to split stage 2

Page 45: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

45

Memory operands only appears in loads or stores in MIPS We can use the execute stage to calculate the

memory and then access memory in the following stage

If we operate on operand in memory, stages 3 and 4 would expand to an address stage, memory stage and then execute stage

Operands must be aligned in memory We can have instructions requiring two data memory

access, the transfer can be done in a single pipeline stage

Designing Instruction Set for Pipelining

Page 46: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

46

Summary: Pipelining

What makes it easy all instructions are the same length just a few instruction formats memory operands appear only in loads and stores

What makes it hard? structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction

We’ll talk about modern processors and what really makes it hard:

exception handling trying to improve performance with out-of-order execution, etc.

Page 47: LOGO Computer Architecture Dr. Esam Al_Qaralleh Princess Sumaya University for Technology.

47

Summary

Pipelining is a fundamental conceptmultiple steps using distinct resources

Utilize capabilities of the datapath by pipelined instruction processing

start next instruction while working on the current one

limited by length of longest stage (plus fill/flush)

detect and resolve hazards