University of Amsterdam Computer Systems – the processor architecture Arnoud Visser 1 Computer...

22
University of Amsterdam Computer Systems – the processor Arnoud Visser 1 Computer Systems The processor architecture

Transcript of University of Amsterdam Computer Systems – the processor architecture Arnoud Visser 1 Computer...

Page 1: University of Amsterdam Computer Systems – the processor architecture Arnoud Visser 1 Computer Systems The processor architecture.

University of Amsterdam

Computer Systems – the processor architecture Arnoud Visser 1

Computer Systems

The processor architecture

Page 2: University of Amsterdam Computer Systems – the processor architecture Arnoud Visser 1 Computer Systems The processor architecture.

University of Amsterdam

Computer Systems – the processor architecture Arnoud Visser 2

Basic Knowledge

• Relative timing of the elements is important

Page 3: University of Amsterdam Computer Systems – the processor architecture Arnoud Visser 1 Computer Systems The processor architecture.

University of Amsterdam

Computer Systems – the processor architecture Arnoud Visser 3

Programmers visible state

Von Neumann architecture, both instructions and data in memory

%eax

%ecx

%edx

%ebx

%esi

%edi

%esp

%ebp

Program registers

PC

Memory

CC

Page 4: University of Amsterdam Computer Systems – the processor architecture Arnoud Visser 1 Computer Systems The processor architecture.

University of Amsterdam

Computer Systems – the processor architecture Arnoud Visser 4

Program counter

• The program counter holds the address of the instruction currently executed

• The next instruction has to be collected from memory(slow!)

Kernel virtual memory

Memory mapped region forshared libraries

Run-time heap(created at runtime by malloc)

User stack(created at runtime)

Unused0

Memoryinvisible touser code0xc0000000

0x08048000

0x40000000

Read/write data

Read-only code and dataLoaded from the hello executable file

printf() function

0xffffffff

PC or

Page 5: University of Amsterdam Computer Systems – the processor architecture Arnoud Visser 1 Computer Systems The processor architecture.

University of Amsterdam

Computer Systems – the processor architecture Arnoud Visser 5

Processing a single instruction

• Fetch– Read the instruction (1-5 bytes) from memory

• Decode– Reads the values from the registers

• Execute– Perform a arithmetic/logic operation OR Test the jump conditions

• Memory– Read/Write to memory

• Write back– Update the registers

• PC update– Set the address of the next instruction

Page 6: University of Amsterdam Computer Systems – the processor architecture Arnoud Visser 1 Computer Systems The processor architecture.

University of Amsterdam

Computer Systems – the processor architecture Arnoud Visser 6

Seq. architecture

• Hardware connected with named wires(word & bytes, byte & bits, bit)

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCC ALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

Registerfile

RegisterfileA B M

E

PC

PC

Instructionmemory

Instructionmemory

PCincrement

PCincrement

rBicodeifun rA

PC

valC valP

Needregids

NeedvalCInstr

valid

AlignAlignSplitSplit

Bytes 1-5Byte 0

Page 7: University of Amsterdam Computer Systems – the processor architecture Arnoud Visser 1 Computer Systems The processor architecture.

University of Amsterdam

Computer Systems – the processor architecture Arnoud Visser 7

Stage Computation: ALU Operation

– Formulate instruction execution as sequence of simple steps

– Use same general form for all instructions

OPl rA, rB

icode:ifun M1[PC]

rA:rB M1[PC+1]

valP PC+2

Fetch

Read instruction byte

Read register byte

Compute next PC

valA R[rA]

valB R[rB]Decode

Read operand A

Read operand B

valE valB ifun valA

Set CCExecute

Perform ALU operation

Set condition code register

Memory

R[rB] valE

Write

back

Write back result

PC valPPC update Update PC

Page 8: University of Amsterdam Computer Systems – the processor architecture Arnoud Visser 1 Computer Systems The processor architecture.

University of Amsterdam

Computer Systems – the processor architecture Arnoud Visser 8

Stage Computation: procedure call

– Use ALU to decrement stack pointer– Store incremented PC

call Dest

icode:ifun M1[PC]

valC M4[PC+1]valP PC+5

Fetch

Read instruction byte

Read destination address

Compute return point

valB R[%esp]Decode

Read stack pointer

valE valB + –4Execute

Decrement stack pointer

M4[valE] valP Memory Write return value on stack

R[%esp] valE

Write

back

Update stack pointer

PC valCPC update Set PC to destination

Page 9: University of Amsterdam Computer Systems – the processor architecture Arnoud Visser 1 Computer Systems The processor architecture.

University of Amsterdam

Computer Systems – the processor architecture Arnoud Visser 9

Stage Computation: jump

– Compute both addresses– Choose based on setting of condition codes

and branch condition XX/ifun

jXX Dest

icode:ifun M1[PC]

valC M4[PC+1]valP PC+5

Fetch

Read instruction byte

Read destination address

Fall through address

Decode

Bch Cond(CC,ifun)Execute

Take branch?

Memory

Write

back

PC Bch ? valC : valPPC update Update PC

Page 10: University of Amsterdam Computer Systems – the processor architecture Arnoud Visser 1 Computer Systems The processor architecture.

University of Amsterdam

Computer Systems – the processor architecture Arnoud Visser 10

Branch conditions

jmp 7 0

jle 7 1

jl 7 2

je 7 3

jne 7 4

jge 7 5

jg 7 6

JXX

Condition Codes Description

1 Direct jump

(SF^OF) | ZF Less or equal <=

SF^OF Less <

ZF Equal ==

~ZF Non equal !=

~(SF^OF) & ~ZF Greater or equal >=

~(SF^OF) Greater >

Page 11: University of Amsterdam Computer Systems – the processor architecture Arnoud Visser 1 Computer Systems The processor architecture.

University of Amsterdam

Computer Systems – the processor architecture Arnoud Visser 11

Datapaths & Control Logic

– ALU fun: select function– ALU A: select Input A– ALU B: select Input B– Set CC: Should condition code

register be loaded?

CCCC ALUALU

ALUA

ALUB

ALUfun.

Bch

icode ifun valC valBvalA

valE

SetCC

bcondbcond

Execute Logic

Page 12: University of Amsterdam Computer Systems – the processor architecture Arnoud Visser 1 Computer Systems The processor architecture.

University of Amsterdam

Computer Systems – the processor architecture Arnoud Visser 12

Control logic: ALU A

valE valB + –4 Decrement stack pointer

No operation

valE valB + 4 Increment stack pointer

valE valB + valC Compute effective address

valE valB OP valA Perform ALU operation

OPl rA, rBExecute

rmmovl rA, D(rB)

popl rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE valB + 4 Increment stack pointer

int aluA = [icode in { IRRMOVL, IOPL } : valA;icode in { IIRMOVL, IRMMOVL, IMRMOVL } : valC;icode in { ICALL, IPUSHL } : -4;icode in { IRET, IPOPL } : 4;# Other instructions don't need ALU

];

Page 13: University of Amsterdam Computer Systems – the processor architecture Arnoud Visser 1 Computer Systems The processor architecture.

University of Amsterdam

Computer Systems – the processor architecture Arnoud Visser 13

Hardware structure

• This can be translated in silicon

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCC ALUALU

Datamemory

Datamemory

NewPC

rB

dstEdstM

ALUA

ALUB

Mem.control

Addr

srcAsrcB

readwrite

ALUfun.

Fetch

Decode

Execute

Memory

Write back

data out

Registerfile

RegisterfileA B M

E

Bch

dstEdstMsrcAsrcB

icodeifun rA

PC

valC valP

valBvalA

Data

valE

valM

PC

newPC

Page 14: University of Amsterdam Computer Systems – the processor architecture Arnoud Visser 1 Computer Systems The processor architecture.

University of Amsterdam

Computer Systems – the processor architecture Arnoud Visser 14

Sequential is too slow

• Clock has to slow enough to let the signal propagate through all wires and transistors

• Critical path: the slowest path between any two storage devices

Clk

.

.

.

.

.

.

.

.

.

.

.

.

Page 15: University of Amsterdam Computer Systems – the processor architecture Arnoud Visser 1 Computer Systems The processor architecture.

University of Amsterdam

Computer Systems – the processor architecture Arnoud Visser 15

Pipelining

• Divide the operations in stages and allow to start the next operation if the first operation is ready with first stage

• Increase the throughput, increase latency

Reg

Reg

Reg

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Comb.logic

A

Comb.logic

B

Comb.logic

C

Clock

Page 16: University of Amsterdam Computer Systems – the processor architecture Arnoud Visser 1 Computer Systems The processor architecture.

University of Amsterdam

Computer Systems – the processor architecture Arnoud Visser 16

Insert registers between stages

• Pipeline registers means extra silicon and delay

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

Registerfile

RegisterfileA B M

E

valP

d_srcA, d_srcB

valA, valB

aluA, aluB

Bch valE

Addr, Data

valM

PC

W_valE, W_valM, W_dstE, W_dstM

W_icode, W_valM

icode, ifun,rA, rB, valC

E

M

W

F

D

valP

f_PC

predPC

Instructionmemory

Instructionmemory

M_icode, M_Bch, M_valA1 2 3 4 5 6 7 8 9

F D E MWF D E M

W

F D E M WF D E M W

F D E M W

Cycle 5WI1MI2EI3DI4FI5

Page 17: University of Amsterdam Computer Systems – the processor architecture Arnoud Visser 1 Computer Systems The processor architecture.

University of Amsterdam

Computer Systems – the processor architecture Arnoud Visser 17

Data hazards

Additional pipeline control is needed to prevent unintended interactions between instructions

• Stalling (wait a few stages till hazard is gone)

• Data forwarding (passing value to E before M/W)

Pipeline architecture already used for i386http://www.pcmech.com/show/processors/35/

Page 18: University of Amsterdam Computer Systems – the processor architecture Arnoud Visser 1 Computer Systems The processor architecture.

University of Amsterdam

Computer Systems – the processor architecture Arnoud Visser 18

Pipeline efficiency

Pipeline control can prevent many, but not all interactions between instructions → bubbles

For the model described in the book:• Load / Use hazards

(20% of load instr. → 1 bubble)

• Mispredicted branches(40% of jmp instr. → 2 bubbles)

• Return from procedure calls(100% of ret instr. → 3 bubbles)

Page 19: University of Amsterdam Computer Systems – the processor architecture Arnoud Visser 1 Computer Systems The processor architecture.

University of Amsterdam

Computer Systems – the processor architecture Arnoud Visser 19

Today’s architectures

• Superscalar (Pentium)(often two instructions/cycle)

• Dynamic execution (P6)(three instructions out-of-order/cycle)

• Explicit parallelism (Itanium)(six execution units)

Page 20: University of Amsterdam Computer Systems – the processor architecture Arnoud Visser 1 Computer Systems The processor architecture.

University of Amsterdam

Computer Systems – the processor architecture Arnoud Visser 20

Metrics of performance

Compiler

Programming Language

Application

DatapathControl

Transistors Wires Pins

ISA

Function Units

(millions) of Instructions per second – MIPS(millions) of (F.P.) operations per second – MFLOP/s

Cycles per second (clock rate)

Megabytes per second

Answers per month

Scaling of algorithms

Each metric has a place and a purpose, and each can be optimized

Page 21: University of Amsterdam Computer Systems – the processor architecture Arnoud Visser 1 Computer Systems The processor architecture.

University of Amsterdam

Computer Systems – the processor architecture Arnoud Visser 21

Summary

• Shown that an instruction set architecture can be translated onto multiple processor architectures– Complicated control logic on datapaths– Compilers have optimize the control logic for

multiple machines/targets– A programmer can add/frustrate compiler

Page 22: University of Amsterdam Computer Systems – the processor architecture Arnoud Visser 1 Computer Systems The processor architecture.

University of Amsterdam

Computer Systems – the processor architecture Arnoud Visser 22

Assignment

• Practice Problem 4.21 (page 314)

Calculate the throughput and latency of a n-stage pipeline for the given 6 blocks

A

80 ps

B

30 ps

C

60 ps

D

50 ps

E

70 ps

F

10 ps

R e g

20 ps