CDA 5155 Computer Architecture Week 1.5. Start with the materials: Conductors and Insulators...

CDA 5155

Computer Architecture

Week 1.5

Start with the materials: Conductors and Insulators

• Conductor: a material that permits electrical current to flow easily. (low resistance to current flow)

Lattice of atoms with free electrons

• Insulator: a material that is a poor conductor of electrical current (High resistance to current flow)

Lattice of atoms with strongly held electrons

• Semi-conductor: a material that can act like a conductor or an insulator depending on conditions. (variable resistance to current flow)

Making a semiconductor using silicon

e e

e

ee

ee

e

ee

e

e

e e

e

e

ee

e

e

What is a pure silicon lattice? A. Conductor B. Insulator C. Semi conductor

N-type Doping

We can increase the conductivity by adding atoms of phosphorus or arsenic to the silicon lattice.They have more electrons (1 more) which is free to wander…

This is called n-type doping since we add some free (negatively charged) electrons


ee

e

e

ee

e

e

e e

e

e

ee

e

e

Pe

ee

e

e

This electron is easilymoved from here

What is a n-doped silicon lattice? A. Conductor B. Insulator C. Semi-conductor

P-type Doping

Interestingly, we can also improve the conductivity by adding atoms of gallium or boron to the silicon lattice.They have fewer electrons (1 fewer) which creates a hole. Holes also conduct current by stealing electrons from their neighbor (thus moving the hole).

This is called p-type doping since we have fewer (negatively charged) electrons in the bond holding the atoms together.


ee

e

e

ee

e

e

e e

e

e

ee

e

e

Gae

ee

?

This atom will accept an electron eventhough it is one too many since it fills theeighth electron position in this shell. Againthis lets current flow since the electron mustcome from somewhere to fill this position.

Using doped silicon to make a junction diode

A junction diode allows current to flow in one direction and blocks it in the other.

Electrons liketo move to Vcc

GND VccElectrons move from GNDto fill holes.

Using doped silicon to make a junction diode

A junction diode allows current to flow in one direction and blocks it in the other.

Current flows

eee

Vcc GND

eeee

Making a transistor

Our first level of abstraction is the transistor. (basically 2 diodes sitting back-to-back)

P-type

Gate

Making a transistor

Transistors are electronic switches connecting the source to the drain if the gate is “on”.

http://www.intel.com/education/transworks/INDEX.HTM

Vcc Vcc Vcc






12/96

Review of basic pipelining

• 5 stage “RISC” load-store architecture– About as simple as things get

• Instruction fetch: • get instruction from memory/cache

• Instruction decode: • translate opcode into control signals and read regs

• Execute: • perform ALU operation

• Memory: • Access memory if load/store

• Writeback/retire: • update register file

13/96

Pipelined implementation

• Break the execution of the instruction into cycles (5 in this case).

• Design a separate datapath stage for the execution performed during each cycle.

• Build pipeline registers to communicate between the stages.

Stage 1: Fetch

Design a datapath that can fetch an instruction from memory every cycle.Use PC to index memory to read instruction

Increment the PC (assume no branches for now)

Write everything needed to complete execution to the pipeline register (IF/ID)The next stage will read this pipeline register.

Note that pipeline register must be edge triggered

Inst

ruct

ion

bit

s

IF / IDPipeline register

PC

Instructionmemory

en

en

1

+

MUX

Res

t of

pip

elin

ed d

atap

ath

PC

+ 1

Stage 2: Decode

Design a datapath that reads the IF/ID pipeline register, decodes instruction and reads register file (specified by regA and regB of instruction bits).Decode is easy, just pass on the opcode and let later stages figure out their own control signals for the instruction.

Write everything needed to complete execution to the pipeline register (ID/EX)Pass on the offset field and both destination register specifiers (or simply pass on the whole instruction!).Including PC+1 even though decode didn’t use it.

Destreg

Data

ID / EXPipeline register

Con

ten

tsO

f re

gAC

onte

nts

Of

regBRegister File

regAregB

en

Res

t of

pip

elin

ed d

atap

ath

Inst

ruct

ion

bit

s

IF / IDPipeline register

PC

+ 1

PC

+ 1

Inst

ruct

ion

bit

sSta

ge 1

: F

etch

dat

apat

h

Stage 3: Execute

Design a datapath that performs the proper ALU operation for the instruction specified and the values present in the ID/EX pipeline register.The inputs are the contents of regA and either the contents of regB or the offset field on the instruction.Also, calculate PC+1+offset in case this is a branch.

Write everything needed to complete execution to the pipeline register (EX/Mem)ALU result, contents of regB and PC+1+offsetInstruction bits for opcode and destReg specifiersResult from comparison of regA and regB contents

ID / EXPipeline register

Con

ten

tsO

f re

gAC

onte

nts

Of

regB

Res

t of

pip

elin

ed d

atap

ath

Alu

Res

ult

EX/MemPipeline register

PC

+ 1

Inst

ruct

ion

bit

s

Sta

ge 2

: D

ecod

e d

atap

ath

Inst

ruct

ion

bit

sP

C+

1+

offs

et

+

con

ten

ts

of r

egB

ALU

MUX

Stage 4: Memory Operation

Design a datapath that performs the proper memory operation for the instruction specified and the values present in the EX/Mem pipeline register.ALU result contains address for ld and st instructions.

Opcode bits control memory R/W and enable signals.

Write everything needed to complete execution to the pipeline register (Mem/WB)ALU result and MemData

Instruction bits for opcode and destReg specifiers

Alu

Res

ult

Mem/WBPipeline register

Res

t of

pip

elin

ed d

atap

ath

Alu

Res

ult

EX/MemPipeline register

Sta

ge 3

: E

xecu

te d

atap

ath

Inst

ruct

ion

bit

sP

C+

1+

offs

etco

nte

nts

of

reg

B

This goes back to the MUXbefore the PC in stage 1.

Mem

ory

Rea

d D

ata

Data Memory

en R/W

Inst

ruct

ion

bit

s

MUX controlfor PC input

Stage 5: Write back

Design a datapath that completes the execution of this instruction, writing to the register file if required.Write MemData to destReg for ld instruction

Write ALU result to destReg for add or nand instructions.

Opcode bits also control register write enable signal.

Alu

Res

ult

Mem/WBPipeline register

Sta

ge 4

: M

emor

y d

atap

ath

Inst

ruct

ion

bit

sM

emor

y R

ead

Dat

a

MUX

This goes back to data input of register file

This goes back to thedestination register specifier

MUX

bits 0-2

bits 16-18

register write enable

PC Instmem

Registerfile

MUX

Sign extend

ALU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

0-2

16-18

Sample Test Question (Easy)

• Which item does not need to be included in the Mem/WB pipeline register for the LC3101 pipelined implementation discussed in class?

A. ALU result

B. Memory read data

C. PC+1+offset

D. Destination register specifier

E. Instruction opcode

C. PC+1+offset

Sample Test Question (Hard?)

• What items need to be added to one of the pipeline registers (discussed in class) to support the

<insert nasty instruction description here> ?

A. IF/ID: PC

B. ID/EX: PC+offset

C. EX/Mem: Contents of regA

D. EX/Mem: ALU2 result

E. Mem/WB: Contents of regA

Things to think about…

1. How would you modify the pipeline datapath if you wanted to double the clock frequency?

2. Would it actually double?

3. How do you determine the frequency?

Sample Code (Simple)

Run the following code on pipelined LC3101:

add 1 2 3 ; reg 3 = reg 1 + reg 2

nand 4 5 6 ; reg 6 = reg 4 & reg 5

lw 2 4 20 ; reg 4 = Mem[reg2+20]

add 2 5 5 ; reg 5 = reg 2 + reg 5

sw 3 7 10 ; Mem[reg3+10] =reg 7

PC Instmem

Reg

iste

r fi

le

MUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

Bits 0-2

Bits 16-18

op

dest

offset

valB

valA

PC+1PC+1target

ALUresult

op

dest

valB

op

dest

ALUresult

mdata

eq?instru

ction

0

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

Bits 22-24

data

dest

PC Instmem

Reg

iste

r fi

le

MUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

Bits 0-2

Bits 16-18

noop

0

0

0

0

000

0

noop

0

0

noop

0

0

0

0

noop

912187

36

41

0

22

R2

R3

R4

R5

R1

R6

R0

R7

Bits 22-24

data

dest

InitialState

Time: 0

PC Instmem

Reg

iste

r fi

le

MUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

Bits 0-2

Bits 16-18

noop

0

0

0

0

010

0

noop

0

0

noop

0

0

0

0add

1 2 3

912187

36

41

0

22

R2

R3

R4

R5

R1

R6

R0

R7

Bits 22-24

data

dest

Fetch: add 1 2 3

add 1 2 3

Time: 1

PC Instmem

Reg

iste

r fi

le

MUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

Bits 0-2

Bits 16-18

add

3

3

9

36

120

0

noop

0

0

noop

0

0

0

0nan

d 4 5 6

912187

36

41

0

22

R2

R3

R4

R5

R1

R6

R0

R7

1

2

Bits 22-24

data

dest

Fetch: nand 4 5 6

nand 4 5 6 add 1 2 3

Time: 2

PC Instmem

Reg

iste

r fi

le

MUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

Bits 0-2

Bits 16-18

nand

6

6

7

18

234

45

add

3

9

noop

0

0

0

0lw 2 4 20

912187

36

41

0

22

R2

R3

R4

R5

R1

R6

R0

R7

4

5

Bits 22-24

data

dest

Fetch: lw 2 4 20

lw 2 4 20 nand 4 5 6 add 1 2 3

Time: 3

36

9

1

3

3

PC Instmem

Reg

iste

r fi

le

MUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

Bits 0-2

Bits 16-18

lw

4

20

18

9

348

-3

nand

6

7

add

3

45

0

0add

2 5 8

912187

36

41

0

22

R2

R3

R4

R5

R1

R6

R0

R7

2

4

Bits 22-24

data

dest

Fetch: add 2 5 5

add 2 5 5 lw 2 4 20 nand 4 5 6 add 1 2 3

Time: 4

18

7

2

6

6

45

3

PC Instmem

Reg

iste

r fi

le

MUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

Bits 0-2

Bits 16-18

add

5

5

7

9

4523

29

lw

4

18

nand

6

-3

0

0sw 3 7 10

945187

36

41

0

22

R2

R3

R4

R5

R1

R6

R0

R7

2

5

Bits 22-24

data

dest

Fetch: sw 3 7 10

sw 3 7 10 add 2 5 5 lw 2 4 20 nand 4 5 6 add

Time: 5

9

20

3

20

4

-3

6

45

3

PC Instmem

Reg

iste

r fi

le

MUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

Bits 0-2

Bits 16-18

sw

7

10

22

45

5 9

16

add

5

7

lw

4

29

99

0

945187

36

-3

0

22

R2

R3

R4

R5

R1

R6

R0

R7

3

7

Bits 22-24

data

dest

No moreinstructions

sw 3 7 10 add 2 5 5 lw 2 4 20 nand

Time: 6

9

7

4

5

5

29

4

-3

6

PC Instmem

Reg

iste

r fi

le

MUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

Bits 0-2

Bits 16-18

15

55

sw

7

22

add

5

16

0

0

945997

36

-3

0

22

R2

R3

R4

R5

R1

R6

R0

R7

Bits 22-24

data

dest

No moreinstructions

sw 3 7 10 add 2 5 5 lw

Time: 7

45

5

10

7

10

16

5

99

4

PC Instmem

Reg

iste

r fi

le

MUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

Bits 0-2

Bits 16-18

sw

7

55

0

9459916

36

-3

0

22

R2

R3

R4

R5

R1

R6

R0

R7

Bits 22-24

data

dest

No moreinstructions

sw 3 7 10 add

Time: 8

2255

22

16

5

PC Instmem

Reg

iste

r fi

le

MUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

Bits 0-2

Bits 16-18

9459916

36

-3

0

22

R2

R3

R4

R5

R1

R6

R0

R7

Bits 22-24

data

dest

No moreinstructions

sw

Time: 9

Time graphs

Time: 1 2 3 4 5 6 7 8 9

add

nand

lw

add

sw

fetch decode execute memory writeback





What can go wrong?

• Data hazards: since register reads occur in stage 2 and register writes occur in stage 5 it is possible to read the wrong value if is about to be written.

• Control hazards: A branch instruction may change the PC, but not until stage 4. What do we fetch before that?

• Exceptions: How do you handle exceptions in a pipelined processor with 5 instructions in flight?

Data Hazards

Data hazardsWhat are they?

How do you detect them?

How do you deal with them?

Pipeline function for ADD

Fetch: read instruction from memory

Decode: read source operands from reg

Execute: calculate sum

Memory: Pass results to next stage

Writeback: write sum into register file

Data Hazards

add 1 2 3nand 3 4 5

time



add

nand

If not careful, nand will read the wrong value of R3

PCInstmem

Reg

iste

r fi

leMUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

Bits 0-2

Bits 16-18

op

dest

offset

valB

valA

PC+1PC+1target

ALUresult

op

dest

valB

op

dest

ALUresult

mdata

eq?instru

ction

0

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

Bits 22-24

data

dest

PCInstmem

Reg

iste

r fi

leMUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

op

dest

offset

valB

valA

PC+1PC+1target

ALUresult

op

dest

valB

op

dest

ALUresult

mdata

eq?instru

ction

0

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

data

dest

PCInstmem

Reg

iste

r fi

leMUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

op

offset

valB

valA

PC+1PC+1target

ALUresult

op

valB

op

ALUresult

mdata

eq?instru

ction

0

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

fwd fwd fwd

data

Three approaches to handling data hazards

AvoidMake sure there are no hazards in the code

Detect and StallIf hazards exist, stall the processor until they go away.

Detect and ForwardIf hazards exist, fix up the pipeline to get the correct value (if possible)

Handling data hazards I: Avoid all hazards

Assume the programmer (or the compiler) knows about the processor implementation.Make sure no hazards exist.

• Put noops between any dependent instructions.

add 1 2 3noopnoopnand 3 4 5

write R3 in cycle 5

read R3 in cycle 5

Problems with this solution

Old programs (legacy code) may not run correctly on new implementationsLonger pipelines need more noops

Programs get larger as noops are includedEspecially a problem for machines that try to execute more than one instruction every cycle

Intel EPIC: Often 25% - 40% of instructions are noops

Program execution is slower–CPI is 1, but some instructions are noops

Handling data hazards II: Detect and stall until ready

Detect:Compare regA with previous DestRegs

• 3 bit operand fields

Compare regB with previous DestRegs • 3 bit operand fields

Stall:Keep current instructions in fetch and decode

Pass a noop to execute

Hazard detection

PCInstmem

Reg

iste

r fi

leMUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

add

3

7

14

PC+1PC+1target

ALUresult

op

valB

op

ALUresult

mdata

eq?nan

d 3 4 5

7 10

14

0

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

data

3

3

First half of cycle 3

REGfile

IF/ID

ID/EX

3

compare

Hazarddetected

regA

regB

compare

compare compare

3

3

Hazard detected

regA

regB

compare

0 1 1

0 1 1

0 0 0

1


• Detect:– Compare regA with previous DestReg


– Compare regB with previous DestReg • 3 bit operand fields

Stall:Keep current instructions in fetch and decode


Hazard

PCInstmem

Reg

iste

r fi

leMUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

add

7

14

12target

ALUresult

valB

ALUresult

mdata

eq?nan

d 3 4 5

7 10

11

14

0

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

data

3

3

en

en



• Detect:– Compare regA with previous DestReg


– Compare regB with previous DestReg • 3 bit operand fields

Stall:– Keep current instructions in fetch and decode


PCInstmem

Reg

iste

r fi

leMUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

noop

2

21

add

ALUresult

mdata

nan

d 3 4 5

7 10 11

14

0

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

data

3

End of cycle 3

Hazard

PCInstmem

Reg

iste

r fi

leMUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

2

21

add

ALUresult

mdata

nan

d 3 4 5

7 10 11

14

0

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

data

3

3

en

en


noop

PCInstmem

Reg

iste

r fi

leMUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

noop

2

noop

add

21

nan

d 3 4 5

7 10 11

14

0

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

data

3

End of cycle 4

No Hazard

PCInstmem

Reg

iste

r fi

leMUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

noop

2

noop

add

21

nan

d 3 4 5

7 10 11

14

0

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

data

3

3


PCInstmem

Reg

iste

r fi

leMUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

nand

11

21

23

noop

noop

add

3 7 7

7 21 11 77

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

5data

End of cycle 5

No more stalling

add 1 2 3nand 3 4 5

time


fetch decode decode decode execute

add

nand

Assume Register File gives the right value of R3 when read/written during same cycle.

hazard hazard

Problems with detect and stall

CPI increases every time a hazard is detected!

Is that necessary? Not always!Re-route the result of the add to the nand

• nand no longer needs to read R3 from reg file

• It can get the data later (when it is ready)

• This lets us complete the decode this cycle– But we need more control to remember that the data that we

aren’t getting from the reg file at this time will be found elsewhere in the pipeline at a later cycle.

Handling data hazards III: Detect and forward

Detect: same as detect and stallExcept that all 4 hazards are treated differently

• i.e., you can’t logical-OR the 4 hazard signals

Forward:New bypass datapaths route computed data to where it is needed

New MUX and control to pick the right data

•Beware: Stalling may still be required even in the presence of forwarding

Sample Code

Which hazards do you see?

add 1 2 3

nand 3 4 5

add 4 3 7

add 6 3 7

lw 3 6 10

sw 6 2 12

Hazard

PCInstmem

Reg

iste

r fi

leMUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

add

7

14

12

nan

d 3 4 5

7 10 1177

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

data

3

fwd fwd fwd

3


PCInstmem

Reg

iste

r fi

leMUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

nand

11

10

23

21

add

add

4 3 7

7 10 11 77

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

5data

H1

3

End of cycle 3

New Hazard

PCInstmem

Reg

iste

r fi

leMUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

nand

11

10

23

21

add

add

6 3 7

7 10 11 77

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

5data

3MUX

H1

3


21

11

PCInstmem

Reg

iste

r fi

leMUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

add

10

1

34

-2

nand

add

21

lw 3 6 10

7 10 1177

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

75 3data

MUX

H2 H1

End of cycle 4

PCInstmem

Reg

iste

r fi

leMUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

add

10

1

34

-2

nand

add

21

lw 3 6 10

7 10 1177

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

75 3data

MUX

H2 H1


3

No Hazard

21

1

PCInstmem

Reg

iste

r fi

leMUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

lw

10

21

4 5

22

add

nand

-2

sw 6 2 12

7 21 11 77

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

7 5data

MUX

H2 H1

6

End of cycle 5

PCInstmem

Reg

iste

r fi

leMUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

lw

10

21

4 5

22

add

nand

-2

sw 6 2 12

7 21 11 77

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

67 5

data

MUX

H2 H1


Hazard

6

en

en

L

PCInstmem

Reg

iste

r fi

leMUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

5

31

lw

add

22

sw 6 2 12

7 21 11 -2

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

6 7data

MUX

H2

End of cycle 6

noop

PCInstmem

Reg

iste

r fi

leMUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

noop

5

31

lw

add

22

sw 6 2 12

7 21 11 -2

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

6 7data

MUX

H2


Hazard

6

PCInstmem

Reg

iste

r fi

leMUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

sw

12

7

1

5

noop

lw

99

7 21 11 -2

14

1

0

22

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

6data

MUX

H3

End of cycle 7

PCInstmem

Reg

iste

r fi

leMUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

sw

12

7

1

5

noop

lw

99

7 21 11 -2

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

6data

MUX

H3


99

12

PCInstmem

Reg

iste

r fi

leMUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

MUX

12

7

1

5

111

sw

noop

7 21 11 -2

14

99

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

data

MUX

H3

End of cycle 8

Control hazards

How can the pipeline handle branch and jump instructions?

Pipeline function for BEQ

Fetch: read instruction from memory

Decode: read source operands from reg

Execute: calculate target address and test for equality

Memory: Send target to PC if test is equal

Writeback: Nothing left to do

Control Hazards

beq 1 1 10sub 3 4 5

time


fetch decode execute

beq

sub

Approaches to handling control hazards

AvoidMake sure there are no hazards in the code

Detect and StallDelay fetch until branch resolved.

Speculate and Squash-if-WrongGo ahead and fetch more instruction in case it is correct, but stop them if they shouldn’t have been executed

Handling control hazards I: Avoid all hazards

Don’t have branch instructions! Maybe a little impractical

Delay taking branch:dbeq r1 r2 offset

Instructions at PC+1, PC+2, etc will execute before deciding whether to fetch from PC+1+offset. (If no useful instructions can be placed after dbeq, noops must be inserted.)

Problems with this solution

Old programs (legacy code) may not run correctly on new implementationsLonger pipelines need more instuctions/noops after delayed beq

Programs get larger as noops are includedEspecially a problem for machines that try to execute more than one instruction every cycleIntel EPIC: Often 25% - 40% of instructions are noops

Program execution is slower–CPI equals 1, but some instructions are noops

Handling control hazards II: Detect and stall

Detection:Must wait until decode

Compare opcode to beq or jalr

Alternately, this is just another control signal

Stall:Keep current instructions in fetch

Pass noop to decode stage (not execute!)

PC Instmem

REGfile

MUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

signext

Control

noop

MUX

Control Hazards

beq 1 1 10sub 3 4 5

time


fetch fetch fetch

beq

sub fetch

or

fetchTarget:

Problems with detect and stall

CPI increases every time a branch is detected!

Is that necessary? Not always!Only about ½ of the time is the branch taken

• Let’s assume that it is NOT taken…– In this case, we can ignore the beq (treat it like a noop)

– Keep fetching PC + 1

• What if we are wrong?– OK, as long as we do not COMPLETE any instructions we

mistakenly executed (i.e. don’t perform writeback)

Handling data hazards III: Speculate and squash

Speculate: assume not equalKeep fetching from PC+1 until we know that the branch is really taken

Squash: stop bad instructions if takenSend a noop to:

• Decode, Execute and Memory

Send target address to PC

PC REGfile

MUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

signext

Control

equal

MUX

beqsubaddnand

add

sub

beq

beq

Instmem

noop

noop

noop

Problems with fetching PC+1

CPI increases every time a branch is taken!About ½ of the time

Is that necessary?

No!, but how can you fetch from the targetbefore you even know the previous instructionis a branch – much less whether it is taken???

PC Instmem

REGfile

MUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

signext

Control

beq

bpc

MUX

target

targ

et

eq?

Branch prediction

Predict not taken: ~50% accurate

Predict backward taken: ~65% accurate

Predict same as last time: ~80% accurate

Pentium: ~85% accurate

Pentium Pro: ~92% accurate

Best paper designs: ~97% accurate

94/96

Handling control hazards II: Detect and stall

• Detection:– Must wait until decode– Compare opcode to beq or jalr– Alternately, this is just another control signal

• Stall:– Keep current instructions in fetch– Pass noop to decode stage (not execute!)

95/96

PC Instmem

REGfile

ALU

M U X

1

Datamemory

++

IF/ID

ID/EX

EX/Mem

Mem/WB

signext

Control

noop

M U X

MUX

MUX

96/96

Role of the Compiler

• The primary user of the instruction set– Exceptions: getting less common

• Some device drivers; specialized library routines• Some small embedded systems (synthesized arch)

• Compilers must:– generate a correct translation into machine code

• Compilers should: – fast compile time; generate fast code

• While we are at it: – generate reasonable code size; good debug support

97/96

Structure of Compilers

• Front-end: translate high level semantics to some generic intermediate form– Intermediate form does not have any resource

constraints, but uses simple instructions.

• Back-end: translates intermediate form into assembly/machine code for target architecture– Resource allocation; code optimization under

resource constraints

Architects mostly concerned with optimization

98/96

Typical optimizations: CSE

• Common sub-expression eliminationc = array1[d+e] / array2[d+e];

c = array1[i] / arrray2[i];

• Purpose: –reduce instructions / faster code

• Architectural issues: –more register pressure

99/96

Typical optimization: LICM

• Loop invariant code motionfor (i=0; i<100; i++) { t = 5; array1[i] = t;}

• Purpose: –remove statements or expressions from loops that

need only be executed once (idempotent)

• Architectural issues:–more register pressure

100/96

Other transformations

• Procedure inlining: better inst schedule– greater code size, more register pressure

• Loop unrolling: better loop schedule– greater code size, more register pressure

• Software pipelining: better loop schedule– greater code size; more register pressure

• In general – “global”optimization: faster code– greater code size; more register pressure

101/96

Compiled code characteristics

• Optimized code has different characteristics than unoptimized code.– Fewer memory references, but it is generally the “easy

ones” that are eliminated• Example: Better register allocation retains active data in

register file – these would be cache hits in unoptimized code.

– Removing redundant memory and ALU operations leaves a higher ratio of branches in the code

• Branch prediction becomes more important

Many optimizations provide better instruction schedulingat the cost of an increase in hardware resource pressure

102/96

What do compiler writers want in an instruction set architecture?• More resources: better optimization tradeoffs• Regularity: same behaviour in all contexts

– no special cases (flags set differently for immediates)

• Orthogonality: – data type independent of addressing mode– addressing mode independent of operation performed

• Primitives, not solutions: – keep instructions simple– it is easier to compose than to fit. (ex. MMX operations)

103/96

What do architects want in an instruction set architecture?

• Simple instruction decode: – tends to increase orthogonality

• Small structures: – more resource constraints

• Small data bus fanout:– tends to reduce orthogonality; regularity

• Small instructions: – Make things implicit– non-regular; non-orthogonal; non-primative

104/96

To make faster processors

• Make the compiler team unhappy– More aggressive optimization over the entire program

– More resource constraints; caches; HW schedulers

– Higher expectations: increase IPC

• Make hardware design team unhappy– Tighter design constraints (clock)

– Execute optimized code with more complex execution characteristics

– Make all stages bottlenecks (Amdahl’s law)

CDA 5155 Computer Architecture Week 1.5. Start with the materials: Conductors and Insulators...

Documents

Transcript of CDA 5155 Computer Architecture Week 1.5. Start with the materials: Conductors and Insulators...