L02-SimpleImps

8/6/2019 L02-SimpleImps

http://slidepdf.com/reader/full/l02-simpleimps 1/55

CS 152 Computer Architecture andEngineering

Lecture 2 - Simple MachineImplementations

Krste AsanovicElectrical Engineering and Computer Sciences

University of California at Berkeley

http://www.eecs.berkeley.edu/~krstehttp://inst.eecs.berkeley.edu/~cs152



1/27/2009 CS152-Spring¶09 2

Last Time in Lecture 1Computer Science at crossroads from sequential toparallel computingComputer Architecture >> ISAs and RTL ± CS152 is about interaction of hardware and software, and design of

appropriate abstraction layers

Comp. Arch. shaped by technology and applications ± History provides lessons for the future

Cost of software development a large constraint onarchitecture ± Compatibility a key solution to software cost

IBM 360 introduces notion of ³family of machines´running same ISA but very different implementations ± Within same generation of machines ± ³Future-proofing´ for subsequent generations of machine



1/27/2009 CS152-Spring¶09 3

B urrough¶s B 5000 Stack Architecture:An ALGOL Machine, Robert B arton, 1960

Machine implementation can be completely hidden if theprogrammer is provided only a high-level language interface.

Stack machine organization because stacks are convenient for:1. expression evaluation;2. subroutine calls, recursion, nested interrupts;3. accessing variables in block-structured languages.

B6700, a later model, had many more innovative features ± tagged data ± virtual memory ± multiple processors and memories



1/27/2009 CS152-Spring¶09 4

A Stack Machine

A Stack machine has a stack asa part of the processor state

typical operations:push, pop, +, *, ...

Instructions like + implicitlyspecify the top 2 elements of the stack as operands.

a

:

stack

Processor

MainStore

ba

push bcba

push c ba

pop



1/27/2009 CS152-Spring¶09 5

abc

Evaluation of Expressions

(a + b * c) / (a + d * c - e) /

+

* +a e-

ac

dc

*b

Reverse Polisha b c * + a d c * + e - /

push apush bpush cmultiply

*

Evaluation Stack

b * c



1/27/2009 CS152-Spring¶09 6

a

Evaluation of Expressions

(a + b * c) / (a + d * c - e) /

+

* +a e-

ac

dc

*b

Reverse Polisha b c * + a d c * + e - /

add

+

Evaluation Stack

b * c

a + b * c



1/27/2009 CS152-Spring¶09 7

H ardware organization of the stack

Stack is part of the processor state s tack mu s t be bounded and s mall

} number of Registers,not the size of main memory

Conceptually stack is unbounded a part of the s tack i s included in the

proce ss or s tate; the re s t i s kept in themain memory



1/27/2009 CS152-Spring¶09 10

Stack Size and Expression Evaluation

program stack (size = 4 )push a R0push b R0 R1

push c R0 R1 R2* R0 R1+ R0push a R0 R1push d R0 R1 R2push c R0 R1 R2 R3* R0 R1 R2+ R0 R1push e R0 R1 R2- R0 R1

/ R0

a b c * + a d c * + e - /

a and c are³loaded´ twice

not the best use of registers!



1/27/2009 CS152-Spring¶09 11

Register Usage in a GPR Machine

M ore control over register usagesince registers can be named explicitly

Load Ri mLoad Ri (Rj)Load Ri (Rj) (Rk)

- eliminates unnecessary Loads and Stores

- fewer Registers

but instructions may be longer!

Load R0 aLoad R1 cLoad R2 b

Mul R2 R1

(a + b * c) / (a + d * c - e)

ReuseR2

Add R2 R0Load R3 dMul R3 R1Add R3 R0

ReuseR3

Load R0 eSub R3 R0Div R2 R3

ReuseR0



1/27/2009 CS152-Spring¶09 12

Stack Machines: Essential features

In addition to push, pop, +etc., the instruction setmust provide thecapability to ± refer to any element in the

data area ± jump to any in s truction in

the code area ± move any element in the

s tack frame to the top

machinery tocarry out+, -, etc.

stackSP

DP

PC

data

.

.

.

abc

push apush bpush c

*+push e

/

code



1/27/2009 CS152-Spring¶09 13

Stack versus GPR Organization Amdahl, Blaauw and Brooks, 1964

1. The performance advantage of push down stack organizationis derived from the presence of fast registers and not the waythey are used.

2.³Surfacing´ of data in stack which are ³profitable´ isapproximately 50% because of constants and common

subexpressions.3. Advantage of instruction density because of implicit addressesis equaled if short addresses to specify registers are allowed.

4. Management of finite depth stack causes complexity.5. Recursive subroutine advantage can be realized only with the

help of an independent stack for addressing.6. Fitting variable-length fields into fixed-width word is awkward.



1/27/2009 CS152-Spring¶09 14

1. Stack programs are not smaller if short (Register)addresses are permitted.

2. Modern compilers can manage fast register space better than the stack discipline.

Stack Machines (Mostly) Died by 1980

P R¶s and caches are better than stack and displays

Early language - directed architectures often did not take into account the role of compilers!

B5 000, B67 00, HP 3000, IC L 2900, Symbolics 3 6 00

Some would claim that an echo of this mistake isvisible in the SPARC architecture register windows -

more later«



1/27/2009 CS152-Spring¶09 17

ISA to Microarchitecture Mapping

ISA often designed with particular microarchitectural stylein mind, e.g., ± CISC microcoded ± RISC hardwired, pipelined ± VLIW fixed-latency in-order pipelines ± J VM software interpretation

But can be implemented with any microarchitectural style ± Core 2 Duo: hardwired pipelined CISC (x86)

machine (with some microcode support)

± This lecture: a microcoded RISC (MIPS) machine ± Intel could implement a dynamically scheduled out-of-order VLIW (IA-64) processor

± ARM J azelle: A hardware J VM processor ± Simics: Software-interpreted SPARC RISC machine



1/27/2009 CS152-Spring¶09 18

Microarchitecture: I mplementation of an IS A

Structure: How components are connected.Static

Behavior: How data moves between components

Dynamic

Controller

Datapath

controlpointsstatus

lines



1/27/2009 CS152-Spring¶09 19

Microcontrol Unit M aurice Wilkes, 1954

Embed the control logic state table in a memory array

Matrix A Matrix B

Decoder

Next state

op conditionalcode flip-flop

Q address

Control lines toALU , M UX s, Registers



1/27/2009 CS152-Spring¶09 20

Microcoded Microarchitecture

Memory(RAM)

Datapath

Qcontroller(ROM)

Addr Data

zero? busy?

opcode

en M emM emWrt

holds fixedmicrocode instructions

holds user programwritten in macrocode

instructions (e.g.,MIPS, x8 6 , etc.)



1/27/2009 CS152-Spring¶09 21

The MIPS32 ISA

Processor State32 32-bit GPRs, R0 always contains a 01 6 double-precision/32 single-precision FPRsFP status register, used for FP compares & exceptionsPC, the program countersome other special registers

Data types8-bit byte, 1 6 -bit half word32-bit word for integers32-bit word for single precision floating point64 -bit word for double precision floating point

Load/Store style instruction setdata addressing modes- immediate & indexedbranch addressing modes- PC relative & register indirectB yte addressable memory- big-endian mode

All instructions are 32 bits

See H& P

A ppendix B for full description



1/27/2009 CS152-Spring¶09 22

MIPS Instruction Formats

6 5 5 1 6

opcode rs offset B EQZ, B NEZ

6 2 6

opcode offset J, JA L

6 5 5 1 6opcode rs JR, JA LR

opcode rs rt immediate rt n (rs) op immediate

6 5 5 5 5 6

0 rs rt rd 0 func rd n (rs) func (rt)ALU

ALU i

6 5 5 1 6

opcode rs rt displacement M[(rs) + displacement]Mem



1/27/2009 CS152-Spring¶09 23

A B us-based Datapath for MIPS

M icroinstruction: register to register transfer (17 control signals)MA n PC means RegSel = PC; enReg=yes; ldMA= yes

B n Reg[rt] means

enMem

MA

addr

data

ldMA

Memory

busy

MemWrt

B us 32

zero?

A B

OpSel ldA ld B

ALU

enA LU

ALU

control

2

RegWrtenReg

addr

data

rsrtrd32(PC)31( L ink)

RegSel

32 GPRs+ PC ...

32-bit Reg

3rsrtrd

ExtSel

IR

Opcode

ldIR

ImmExt

enImm

2

RegSel = rt; enReg=yes; ld B = yes



1/27/2009 CS152-Spring¶09 24

Memory Module

Assumption: Memory operates independentlyand is slow as compared to Reg-to-Reg transfers

(multiple CP U clock cycles per access)

EnableWrite(1)/Read(0)RAM

din dout

we

addr busy

bus



1/27/2009 CS152-Spring¶09 25

Instruction Execution

Execution of a MIPS instruction involves

1. instruction fetch2. decode and register fetch3. A LU operation4 . memory operation (optional)5 . write back to register file (optional)

+ the computation of the

next instruction address



1/27/2009 CS152-Spring¶09 26

Microprogram Fragments

instr fetch: MA n PC A n PCIR n MemoryPC n A + 4dispatch on OPcode

can betreated a s

a macro

ALU: An Reg[rs]B n Reg[rt]Reg[rd] n func(A,B)do instruction fetch

ALUi: An Reg[rs]B n Imm s ign exten s ion ...Reg[rt] n Opcode(A,B)do instruction fetch



1/27/2009 CS152-Spring¶09 27

Microprogram Fragments ( cont.)

LW: A n Reg[rs]B n ImmMA n A + B

Reg[rt] n Memorydo instruction fetch

J: A n PCB n IRPC n JumpTarg(A, B )do instruction fetch

beqz: A n Reg[rs]

I f zero?(A) then go to bz-takendo instruction fetch

bz-taken: A n PCB n Imm << 2PC n A + B

do instruction fetch

JumpTarg( A ,B) ={ A[ 31:28],B [ 25:0],00}



1/27/2009 CS152-Spring¶09 28

MIPS Microcontroller: first attempt

nextstate

QPC (state)

Opcodezero?

B usy (memory)

Control Signals (1 7 )

ss

6

QProgram ROM

addr

data

latching the inputsmay cause aone - cycle delay

= 2 (opcode+status+s) words

How bigis ³s´?

RO M size ?

Word size ? = control+s bits



1/27/2009 CS152-Spring¶09 29

Microprogram in the ROM worksheet

State Op zero? busy Control points next-state

fetch 0 * * * MA n PC fetch 1fetch 1 * * yes .... fetch 1fetch 1 * * no IR n Memory fetch 2

fetch 2 * * * A n PC fetch 3fetch 3 * * * PC n A + 4 ?

ALU 0 * * * A n Reg[rs] A LU 1ALU 1 * * * B n Reg[rt] A LU 2ALU 2 * * * Reg[rd] n func(A, B ) fetch 0

fetch 3 ALU * * PC n A + 4 ALU 0



1/27/2009 CS152-Spring¶09 30

Microprogram in the ROM

State Op zero? busy Control points next-statefetch 0 * * * MA n PC fetch 1fetch 1 * * yes .... fetch 1fetch 1 * * no IR n Memory fetch 2fetch 2 * * * A n PC fetch 3fetch

3ALU * * PC n A + 4 ALU

0fetch 3 ALU i * * PC n A + 4 ALU i0fetch 3 LW * * PC n A + 4 LW0fetch 3 SW * * PC n A + 4 SW 0fetch 3 J * * PC n A + 4 J0fetch 3 JAL * * PC n A + 4 JAL 0

fetch 3 JR * * PC n A +4

JR 0fetch 3 JALR * * PC n A + 4 JALR0fetch 3 beqz * * PC n A + 4 beqz 0...

ALU 0 * * * A n Reg[rs] A LU 1ALU 1 * * * B n Reg[rt] A LU 2

ALU 2 * * * Reg[rd] n func(A, B ) fetch 0



1/27/2009 CS152-Spring¶09 31

Microprogram in the ROM C ont.

State Op zero? busy Control points next-stateALU i0 * * * A n Reg[rs] A LU i1ALU i1 sExt * * B n sExt 1 6 (Imm) A LU i2ALU i1 uExt * * B n uExt 1 6 (Imm) A LU i2ALU i2 * * * Reg[rd] n Op(A, B ) fetch 0

...J0 * * * A n PC J 1J1 * * * B n IR J 2J2 * * * PC n JumpTarg(A, B ) fetch 0...

beqz0

* * * A n Reg[rs] beqz1beqz 1 * yes * A n PC beqz 2

beqz 1 * no * .... fetch 0beqz 2 * * * B n sExt 1 6 (Imm) beqz 3beqz 3 * * * PC n A+ B fetch 0...

JumpTarg( A ,B) = { A[ 31:28],B [ 25:0],00}



1/27/2009 CS152-Spring¶09 32

Size of Control Store

M IPS: w = 6+2 c = 17 s = ?no. of steps per opcode = 4 to 6 + fetch-sequenceno. of states } (4 steps per op-group ) x op-groups

+ common sequences= 4 x 8 + 10 states = 42 states s = 6

Control ROM = 2 (8+6) x 23 bits } 48 Kbytes

size = 2 (w+s) x (c + s) Control ROM

data

status & opcode

addr

next QPC

Control signals

QPC / w

/ s

/ c



1/27/2009 CS152-Spring¶09 33

Reducing Control Store Size

Reduce the ROM height (= address bits)± reduce inputs by extra external logic

each input bit doubles the size of the

control store± reduce states by grouping opcodes

find common sequences of actions± condense input status bits

combine all exceptions into one, i.e.,exception/no-exception

Reduce the ROM width± restrict the next - state encoding

Next, Dispatch on opcode, Wait for memory, ...± encode control signals (vertical microcode)

Control store has to be fast expensive



1/27/2009 CS152-Spring¶09 34

CS152 Administrivia

Room change/decision: ± Either Soda 306 (here) for all but two lectures,

back in 320 for those ± Or 182 Dwinelle for all lectures

± Vote ?Class schedule almost up to date (but might need tochange quiz depending on vote above)Lab 1 coming out on Thursday, together with PS1



1/27/2009 CS152-Spring¶09 35

Problem Sets

Designed to help you learn material and practice for quizMust hand in day before section that reviewsproblem set, which will be shortly before quizSolutions handed out in review section

Quiz assumes you have worked through problem set.Problem sets graded on 0,1,2 scale



1/27/2009 CS152-Spring¶09 36

Collaboration Policy

Can collaborate to understand problem sets, butmust turn in own solution. Some problems repeatedfrom earlier years - do not copy solutions. Q uizproblems will not be repeated«Each student must complete directed portion of thelab by themselves. OK to collaborate to understandhow to run labs ± Class news group info on web site.

Can work in group of up to 3 students for open-ended portion of each lab ± OK to be in different group for each lab -just make sure to label

participants¶ names clearly on each turned in lab section



1/27/2009 CS152-Spring¶09 37

MIPS Controller V2

QJumpType =next | spin

| fetch | dispatch| feqz | fnez

Control Signals (1 7 )

Control ROM

address

data

+1

Opcode ext

QPC (state)

jumplogic

zero

QPC QPC+1

absolute

op-group

busy

QPCSrcinput encodingreduces RO M height

next - state encodingreduces RO M width



1/27/2009 CS152-Spring¶09 38

J ump Logic

QPCSrc = C ase QJumpTypes

next QPC+1

spin if (busy) then QPC else QPC+1

fetch absolute

dispatch op-group

feqz if (zero) then absolute else QPC+1

fnez if (zero) then QPC+1 else absolute



1/27/2009 CS152-Spring¶09 39

Instruction Fetch & ALU: M IPS- C ontroller -2

State Control points next-state

fetch 0 MA n PCfetch 1 IR n Memoryfetch 2 A n PC

fetch 3 PC n A + 4 ...ALU 0 A n Reg[rs]ALU 1 B n Reg[rt]ALU 2 Reg[rd] n func(A, B )

ALU i0 A n Reg[rs]ALU i1 B n sExt 1 6 (Imm)ALU i2 Reg[rd] n Op(A, B )

nextspinnext

dispatch

nextnextfetch

nextnextfetch



1/27/2009 CS152-Spring¶09 40

Load & Store: M IPS- C ontroller -2


LW0 A n Reg[rs] nextLW1 B n sExt 1 6 (Imm) nextLW2 MA n A+ B nextLW3 Reg[rt] n Memory spinLW4 fetch

SW 0 A n Reg[rs] nextSW 1 B n sExt 1 6 (Imm) next

SW 2 MA n A+B

nextSW 3 Memory n Reg[rt] spinSW 4 fetch



1/27/2009 CS152-Spring¶09 41

B ranches: M IPS- C ontroller -2


B EQZ 0 A n Reg[rs] nextB EQZ 1 fnezB EQZ 2 A n PC nextB EQZ 3 B n sExt 1 6 (Imm<<2) nextB EQZ 4 PC n A+ B fetch

B NEZ 0 A n Reg[rs] nextB NEZ 1 feqzB NEZ

2A n PC next

B NEZ 3 B n sExt 1 6 (Imm<<2) nextB NEZ 4 PC n A+ B fetch



1/27/2009 CS152-Spring¶09 42

J umps: M IPS- C ontroller -2


J0 A n PC nextJ1 B n IR nextJ2 PC n JumpTarg(A, B ) fetch

JR 0 A n Reg[rs] nextJR 1 PC n A fetch

JAL 0 A n PC nextJAL 1 Reg[31] n A nextJAL 2 B n IR next

JAL

3 PC n JumpTarg(A,B

) fetchJALR0 A n PC nextJALR1 B n Reg[rs] nextJALR2 Reg[31] n A nextJALR3 PC n B fetch



1/27/2009 CS152-Spring¶09 43

VAX 11-780 Microcode



1/27/2009 CS152-Spring¶09 45

Mem-Mem ALU Instructions:M IPS- C ontroller -2

M em -M em ALU op M[(rd)] n M[(rs)] op M[(rt)]

ALU MM0 MA n Reg[rs] nextALU MM1 A n Memory spinALU MM2 MA n Reg[rt] next

ALU

MM3B

n Memory spinALU MM4 MA n Reg[rd] nextALU MM5 Memory n func(A, B ) spinALU MM6 fetch

Complex instructions usually do not require datapathmodifications in a microprogrammed implementation

-- only extra space for the control program

Implementing these instructions using a hardwiredcontroller is difficult without datapath modifications



1/27/2009 CS152-Spring¶09 46

Performance Issues

Microprogrammed controlmultiple cycles per instruction

Cycle time ?t C > max(t reg-reg , t ALU , t QROM)

Suppose 10 * t QROM < t RAM

G ood performance, relative to a single - cyclehardwired implementation, can be achieved even with a C P I of 10



1/27/2009 CS152-Spring¶09 47

H orizontal vs Vertical QCode

Horizontal Qcode has wider Qinstructions ± Multiple parallel operations per Qinstruction ± Fewer steps per macroinstruction ± Sparser encoding more bits

Vertical Qcode has narrower Qinstructions ± Typically a single datapath operation per Qinstruction

± separate Qinstruction for branches ± More steps to per macroinstruction ± More compact less bits

Nanocoding ± Tries to combine best of horizontal and vertical Qcode

# QInstructions

B its per QInstruction



1/27/2009 CS152-Spring¶09 48

N anocoding

MC68000 had 17-bit Qcode containing either 10-bit Q jump or 9-bit nanoinstruction pointer ± Nanoinstructions were 68 bits wide, decoded to give 196

control signals

Qcode ROM

nanoaddress

Qcodenext-state

Qaddress

QPC (state)

nanoinstruction ROMdata

Exploits recurringcontrol signal patternsin Qcode, e.g.,

ALU

0 A n Reg[rs]...ALU i0 A n Reg[rs]...



1/27/2009 CS152-Spring¶09 49

Microprogramming in I B M 360

O nly the fa s te s t model s (75 and 95) were hardwired

M30 M40 M50 M65Datapath width(bits)

8 16 32 64

Qinst width(bits)

50 52 85 87

Qcode size(K µinsts) 4 4 2.75 2.75Qstoretechnology

CCROS TCROS BCROS BCROS

Qstore cycle(ns)

750 625 500 200

memory cycle(ns)

1500 2500 2000 750

Rental fee($K/month)

4 7 15 35



1/27/2009 CS152-Spring¶09 50

Microcode Emulation

IBM initially miscalculated the importance of softwarecompatibility with earlier models when introducing the360 seriesHoneywell stole some IBM 1401 customers by offeringtranslation software (³Liberator´) for Honeywell H200series machineIBM retaliated with optional additional microcode for 360series that could emulate IBM 1401 ISA, later extendedfor IBM 7000 series

± one popular program on 1401 was a 650 simulator, so somecustomers ran many 650 programs on emulated 1401s ± (6 50 s imulated on 1401 emulated on 3 6 0)



1/27/2009 CS152-Spring¶09 51

Microprogramming thrived in theSeventies

Significantly faster ROMs than DRAMs were availableFor complex instruction sets, datapath and controller were cheaper and s impler N ew in s truction s , e.g., floating point, could be supported

without datapath modificationsF ixing bug s in the controller was easier ISA compatibility across various models could beachieved easily and cheaply

Except for the cheapest and fastest machines,all computers were microprogrammed



1/27/2009 CS152-Spring¶09 52

W ritable Control Store ( W CS)

Implement control store in RAM not ROM ± MOS SRAM memories now almost as fast as control store (corememories/DRAMs were 2-10x slower)

± Bug-free microprograms difficult to write

User-WCS provided as option on several minicomputers ± Allowed users to change microcode for each processor

User-WCS failed ± Little or no programming tools support ± Difficult to fit software into small space

± Microcode control tailored to original ISA, less useful for others ± Large WCS part of processor state - expensive context switches ± Protection difficult if user can change microcode ± Virtual memory required re s tartable microcode



1/27/2009 CS152-Spring¶09 53

Microprogramming: early EightiesEvolution bred more complex micro-machines

± Complex instruction sets led to need for subroutine and call stacks in µcode ± Need for fixing bugs in control programs was in conflict with read-only nature

of µROM ± --> WCS (B1700, Q Machine, Intel i432, «)

With the advent of VLSI technology assumptions about ROM &

RAM speed became invalid -> more complexityBetter compilers made complex instructions less important.

Use of numerous micro-architectural innovations, e.g.,pipelining, caches and buffers, made multiple-cycle execution of

reg-reg instructions unattractive

Looking ahead to RISC next time ± Use chip area to build fast instruction cache of user-visible vertical

microinstructions - use software subroutine not hardware microroutines ± Use simple ISA to enable hardwired pipelined implementation



1/27/2009 CS152-Spring¶09 54

Modern Usage

M icroprogramming is far from extinct

Played a crucial role in micros of the EightiesDE C uV AX , M otorola 68K series, I ntel 386 and 4 86

Microcode pays an assisting role in most modernmicros ( AM D Opteron, I ntel C ore 2 Duo, I ntel Atom, I BM P ower P C )Most instructions are executed directly, i.e., with hard-wiredcontrolInfrequently-used and/or complicated instructions invoke themicrocode engine

P atchable microcode common for post-fabricationbug fixes, e.g. Intel processors load µcode patchesat bootup



Acknowledgements

These slides contain material developed andcopyright by: ± Arvind (MIT) ± Krste Asanovic (MIT/UCB) ± J oel Emer (Intel/MIT)

±J

ames Hoe (CMU) ± J ohn Kubiatowicz (UCB) ± David Patterson (UCB)

MIT material derived from course 6.823

UCB material derived from course CS252

L02-SimpleImps

Documents

Transcript of L02-SimpleImps