University of Amsterdam Computer Systems – the processor architecture Arnoud Visser 1 Computer...

University of Amsterdam

Computer Systems – the processor architecture Arnoud Visser 1

Computer Systems

The processor architecture



Basic Knowledge

• Relative timing of the elements is important



Programmers visible state

Von Neumann architecture, both instructions and data in memory

%eax

%ecx

%edx

%ebx

%esi

%edi

%esp

%ebp

Program registers

PC

Memory

CC



Program counter

• The program counter holds the address of the instruction currently executed

• The next instruction has to be collected from memory(slow!)

Kernel virtual memory

Memory mapped region forshared libraries

Run-time heap(created at runtime by malloc)

User stack(created at runtime)

Unused0

Memoryinvisible touser code0xc0000000

0x08048000

0x40000000

Read/write data

Read-only code and dataLoaded from the hello executable file

printf() function

0xffffffff

PC or



Processing a single instruction

• Fetch– Read the instruction (1-5 bytes) from memory

• Decode– Reads the values from the registers

• Execute– Perform a arithmetic/logic operation OR Test the jump conditions

• Memory– Read/Write to memory

• Write back– Update the registers

• PC update– Set the address of the next instruction



Seq. architecture

• Hardware connected with named wires(word & bytes, byte & bits, bit)

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCC ALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

Registerfile

RegisterfileA B M

E

PC

PC

Instructionmemory

Instructionmemory

PCincrement

PCincrement

rBicodeifun rA

PC

valC valP

Needregids

NeedvalCInstr

valid

AlignAlignSplitSplit

Bytes 1-5Byte 0



Stage Computation: ALU Operation

– Formulate instruction execution as sequence of simple steps

– Use same general form for all instructions

OPl rA, rB

icode:ifun M1[PC]

rA:rB M1[PC+1]

valP PC+2

Fetch

Read instruction byte

Read register byte

Compute next PC

valA R[rA]

valB R[rB]Decode

Read operand A

Read operand B

valE valB ifun valA

Set CCExecute

Perform ALU operation

Set condition code register

Memory

R[rB] valE

Write

back

Write back result

PC valPPC update Update PC



Stage Computation: procedure call

– Use ALU to decrement stack pointer– Store incremented PC

call Dest

icode:ifun M1[PC]

valC M4[PC+1]valP PC+5

Fetch


Read destination address

Compute return point

valB R[%esp]Decode

Read stack pointer

valE valB + –4Execute

Decrement stack pointer

M4[valE] valP Memory Write return value on stack

R[%esp] valE

Write

back

Update stack pointer

PC valCPC update Set PC to destination



Stage Computation: jump

– Compute both addresses– Choose based on setting of condition codes

and branch condition XX/ifun

jXX Dest

icode:ifun M1[PC]

valC M4[PC+1]valP PC+5

Fetch


Read destination address

Fall through address

Decode

Bch Cond(CC,ifun)Execute

Take branch?

Memory

Write

back

PC Bch ? valC : valPPC update Update PC



Branch conditions

jmp 7 0

jle 7 1

jl 7 2

je 7 3

jne 7 4

jge 7 5

jg 7 6

JXX

Condition Codes Description

1 Direct jump

(SFÔF) | ZF Less or equal <=

SFÔF Less <

ZF Equal ==

~ZF Non equal !=

~(SFÔF) & ~ZF Greater or equal >=

~(SFÔF) Greater >



Datapaths & Control Logic

– ALU fun: select function– ALU A: select Input A– ALU B: select Input B– Set CC: Should condition code

register be loaded?

CCCC ALUALU

ALUA

ALUB

ALUfun.

Bch

icode ifun valC valBvalA

valE

SetCC

bcondbcond

Execute Logic



Control logic: ALU A

valE valB + –4 Decrement stack pointer

No operation

valE valB + 4 Increment stack pointer

valE valB + valC Compute effective address

valE valB OP valA Perform ALU operation

OPl rA, rBExecute

rmmovl rA, D(rB)

popl rA

jXX Dest

call Dest

ret

Execute

Execute

Execute

Execute

Execute valE valB + 4 Increment stack pointer

int aluA = [icode in { IRRMOVL, IOPL } : valA;icode in { IIRMOVL, IRMMOVL, IMRMOVL } : valC;icode in { ICALL, IPUSHL } : -4;icode in { IRET, IPOPL } : 4;# Other instructions don't need ALU

];



Hardware structure

• This can be translated in silicon

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCC ALUALU

Datamemory

Datamemory

NewPC

rB

dstEdstM

ALUA

ALUB

Mem.control

Addr

srcAsrcB

readwrite

ALUfun.

Fetch

Decode

Execute

Memory

Write back

data out

Registerfile

RegisterfileA B M

E

Bch

dstEdstMsrcAsrcB

icodeifun rA

PC

valC valP

valBvalA

Data

valE

valM

PC

newPC



Sequential is too slow

• Clock has to slow enough to let the signal propagate through all wires and transistors

• Critical path: the slowest path between any two storage devices

Clk

.

.

.

.

.

.

.

.

.

.

.

.



Pipelining

• Divide the operations in stages and allow to start the next operation if the first operation is ready with first stage

• Increase the throughput, increase latency

Reg

Reg

Reg

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Comb.logic

A

Comb.logic

B

Comb.logic

C

Clock



Insert registers between stages

• Pipeline registers means extra silicon and delay

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

Registerfile

RegisterfileA B M

E

valP

d_srcA, d_srcB

valA, valB

aluA, aluB

Bch valE

Addr, Data

valM

PC

W_valE, W_valM, W_dstE, W_dstM

W_icode, W_valM

icode, ifun,rA, rB, valC

E

M

W

F

D

valP

f_PC

predPC

Instructionmemory

Instructionmemory

M_icode, M_Bch, M_valA1 2 3 4 5 6 7 8 9

F D E MWF D E M

W

F D E M WF D E M W

F D E M W

Cycle 5WI1MI2EI3DI4FI5



Data hazards

Additional pipeline control is needed to prevent unintended interactions between instructions

• Stalling (wait a few stages till hazard is gone)

• Data forwarding (passing value to E before M/W)

Pipeline architecture already used for i386http://www.pcmech.com/show/processors/35/



Pipeline efficiency

Pipeline control can prevent many, but not all interactions between instructions → bubbles

For the model described in the book:• Load / Use hazards

(20% of load instr. → 1 bubble)

• Mispredicted branches(40% of jmp instr. → 2 bubbles)

• Return from procedure calls(100% of ret instr. → 3 bubbles)



Today’s architectures

• Superscalar (Pentium)(often two instructions/cycle)

• Dynamic execution (P6)(three instructions out-of-order/cycle)

• Explicit parallelism (Itanium)(six execution units)



Metrics of performance

Compiler

Programming Language

Application

DatapathControl

Transistors Wires Pins

ISA

Function Units

(millions) of Instructions per second – MIPS(millions) of (F.P.) operations per second – MFLOP/s

Cycles per second (clock rate)

Megabytes per second

Answers per month

Scaling of algorithms

Each metric has a place and a purpose, and each can be optimized



Summary

• Shown that an instruction set architecture can be translated onto multiple processor architectures– Complicated control logic on datapaths– Compilers have optimize the control logic for

multiple machines/targets– A programmer can add/frustrate compiler



Assignment

• Practice Problem 4.21 (page 314)

Calculate the throughput and latency of a n-stage pipeline for the given 6 blocks

A

80 ps

B

30 ps

C

60 ps

D

50 ps

E

70 ps

F

10 ps

R e g

20 ps

University of Amsterdam Computer Systems – the processor architecture Arnoud Visser 1 Computer...

Documents

Transcript of University of Amsterdam Computer Systems – the processor architecture Arnoud Visser 1 Computer...