CS 152 Computer Architecture and Engineering Lecture...

46
8/30/16 CS152, Fall 2016 CS 152 Computer Architecture and Engineering Lecture 2 - Simple Machine Implementations John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw http://inst.eecs.berkeley.edu/~cs152

Transcript of CS 152 Computer Architecture and Engineering Lecture...

8/30/16 CS152,Fall2016

CS152ComputerArchitectureandEngineering

Lecture2- SimpleMachineImplementations

JohnWawrzynekElectricalEngineeringandComputerSciences

UniversityofCaliforniaatBerkeley

http://www.eecs.berkeley.edu/~johnwhttp://inst.eecs.berkeley.edu/~cs152

8/30/16 CS152,Fall2016

LastTimeinLecture1§ ComputerArchitecture>>ISAs andRTL

– CS152isaboutinteractionofhardwareandsoftware,anddesignofappropriateabstractionlayers

§ TechnologyandApplicationsshapeComputerArchitecture– Historyprovideslessonsforthefuture

§ First130yearsofCompArch,fromBabbagetoIBM360– Movefromcalculators(noconditionals)tofullyprogrammablemachines– RapidchangestartedinWWII(mid-1940s),movefromelectro-mechanicaltopureelectronicprocessors

§ Costofsoftwaredevelopmentbecomesalargeconstraintonarchitecture(needcompatibility)

§ IBM360introducesnotionof“familyofmachines”runningsameISAbutverydifferentimplementations– Sixdifferentmachinesreleasedonsameday(April7,1964)– “Future-proofing”forsubsequentgenerationsofmachine

2

8/30/16 CS152,Fall2016

IBM360:InitialImplementations

3

Model30 ... Model70Memory 8K- 64KB 256K- 512KBDatapath 8-bit 64-bitCircuitDelay 30nsec/level 5nsec/levelLocalStore MainStore TransistorRegistersControlStore Readonly1usec Conventionalcircuits

IBM360instructionsetarchitecture(ISA)completelyhidtheunderlyingtechnologicaldifferencesbetweenvariousmodels.Milestone:ThefirsttrueISAdesignedasportablehardware-softwareinterface!

8/30/16 CS152,Fall2016 4

IBM360SurvivesToday:z12MainframeProcessor

[FromIBMHotChips24presentation,August28,2012]

[email protected]

Special-purposecoprocessorsoneachcore

32nmSOITechnology2.75billiontransistors23.7mmx 25.2mm15layersofmetal7.68milesofwiring!10,000powerpins(!)1,071I/Opins

48MBofLevel-3cacheonchip

8/30/16 CS152,Fall2016

InstructionSetArchitecture(ISA)

§ Thecontractbetweensoftwareandhardware§ Typicallydescribedbygivingalltheprogrammer-visiblestate(registers+memory)plusthesemanticsoftheinstructionsthatoperateonthatstate

§ IBM360wasfirstlineofmachinestoseparateISAfromimplementation(aka.microarchitecture)

§ ManyimplementationspossibleforagivenISA– E.g.,theSovietsbuildcode-compatibleclonesoftheIBM360,asdidAmdahlafterheleftIBM.

– E.g.2.,todayyoucanbuyAMDorIntelprocessorsthatrunthex86-64ISA.– E.g.3:manycellphones usetheARMISAwithimplementationsfrommanydifferentcompaniesincludingTI,Qualcomm,Samsung,Marvell,etc.

5

8/30/16 CS152,Fall2016

ISAtoMicroarchitectureMapping

§ ISAoftendesignedwithparticularmicroarchitectural styleinmind,e.g.,– Accumulator ⇒ hardwired,unpipelined– CISC ⇒microcoded– RISC ⇒ hardwired,pipelined– VLIW ⇒ fixed-latencyin-order parallelpipelines– JVM ⇒ softwareinterpretation

§ Butcanbeimplementedwithanymicroarchitectural style– IntelIvyBridge:hardwiredpipelinedCISC(x86)

machine(withsomemicrocodesupport)– Simics:Software-interpretedSPARCRISCmachine– ARMJazelle:AhardwareJVMprocessor– Thislecture:amicrocoded RISC-Vmachine

6

8/30/16 CS152,Fall2016

Today,Microprogramming

§ToshowhowtobuildverysmallprocessorswithcomplexISAs§TohelpyouunderstandwhereCISC*machinescamefrom§Because stillusedin commonmachines(IBM360,x86,PowerPC)§Asagentleintroductionintomachinestructures§TohelpunderstandhowtechnologydrovethemovetoRISC*

*“CISC”/”RISC”namesmuchnewerthanstyleofmachinestheyreferto.

7

8/30/16 CS152,Fall2016

Microarchitecture: ImplementationofanISA

8

Structure: Howcomponentsareconnected.Static

Behavior: HowdatamovesbetweencomponentsDynamic

Controller

Datapath

ControlPointsStatus

lines

8/30/16 CS152,Fall2016

Microcontrol UnitMauriceWilkes,1954

9

Embedthecontrollogicstatetableinamemoryarray

FirstusedinEDSAC-2,completed1958

MatrixA MatrixB

Decoder

Next state

opconditionalcodeflip-flop

µaddress

ControllinestoALU,MUXs,Registers

Memory

8/30/16 CS152,Fall2016

Microcoded Microarchitecture

10

Memory(RAM)

Datapath

µcontroller(ROM)

AddrData

zero?busy?

opcode

enMemMemWrt

holds fixedmicrocode instructions

holds user program written in macrocode

instructions (e.g., x86, RISC-V, etc.)

8/30/16 CS152,Fall2016

RISC-VISA§ NewRISCdesignfromUCBerkeley§ Realistic&completeISA,butopen&small§ Notover-architectedforacertainimplementationstyle§ Both32-bitand64-bitaddressspacevariants

– RV32andRV64

§ Designedformultiprocessing§ Efficientinstructionencoding§ Easytosubset/extendforeducation/research§ Tech.reportwithRISC-Vspecavailableonclasswebsite

§ We’llbeusing32-bitRISC-Vthissemesterinlecturesandlabs,verysimilartoMIPSyousawinCS61C

11

8/30/16 CS152,Fall2016

RV32ProcessorState

12

Programcounter(pc)

32x32-bit integerregisters(x0-x31)• x0alwayscontainsa0

32floating-point(FP)registers(f0-f31)• eachcancontainasingle- ordouble-precisionFPvalue(32-bitor64-bitIEEEFP)

FPstatusregister(fsr),usedforFProundingmode&exceptionreporting

8/30/16 CS152,Fall2016

RISC-VInstructionEncoding

§ Cansupportvariable-lengthinstructions.§ Baseinstructionset(RV32)alwayshasfixed32-bitinstructionslowesttwobits=112

§ Allbranchesandjumpshavetargetsat16-bitgranularity(eveninbaseISAwhereallinstructionsarefixed32bits)

13

8/30/16 CS152,Fall2016

RISC-VInstructionFormats

14

DestinationReg. Reg.

Source1

Reg.Source2

7-bitopcodefield(butlow2bits=112)

Additionalopcodebits/immediate

8/30/16 CS152,Fall2016

R-Type/I-Type/R4-TypeFormats

15

Reg.Source3

12-bitsignedimmediate

Reg-Reg ALUoperations

Reg-Imm ALUoperationsLoadinstructions,(rs1+immediate)addressing

Onlyusedforfloating-pointfusedmultiply-add

8/30/16 CS152,Fall2016

B-Type

16

12-bitsignedimmediatesplitacrosstwofields

Branches,comparetworegisters,PC+(immediate<<1)target

(Branchesdonothavedelayslot)

Storeinstructions,(rs1+immediate)addressing,rs2data

8/30/16 CS152,Fall2016

L-Type

17

Writes20-bitimmediatetotopofdestinationregister.

Usedtobuildlargeimmediates.

12-bitimmediates aresigned,sohavetoaccountforsignwhenbuilding32-bitimmediates in2-instructionsequence(LUIhigh-20b,ADDIlow-12b)

8/30/16 CS152,Fall2016

J-Type

18

“J”Unconditionaljump,PC+offset target

“JAL”Jumpandlink,alsowritesPC+4tox1

Offsetscaledby1-bitleftshift– canjumpto16-bitinstructionboundary(Sameforbranches)

8/30/16 CS152,Fall2016

ABus-basedDatapath forRISC-V

20

Microinstruction:registertoregistertransfer(17controlsignals+clock)MA ←PC meansRegSel =PC;enReg=yes;ldMA=yes

B ←Reg[rs2]means

enMem

MA

addr

data

ldMA

Memory

busy

MemWrt

Bus 32

zero?

A B

ALUOp ldA ldB

ALU

enALU

RegWrtenReg

addr

data

rs1rs2rd32(PC)1(RA)

RegSel

32GPRs+PC...

32-bitReg

3

rs1rs2rd

ImmSel

IR

Opcode

ldIR

ImmedSelect

enImm

2

RegSel =rs2;enReg=yes;ldB =yes

8/30/16 CS152,Fall2016

MemoryModule

21

Assumption:MemoryoperatesindependentlyandisslowascomparedtoReg-to-Regtransfers(multipleCPUclockcyclesperaccess)

EnableWrite(1)/Read(0)RAM

din dout

we

addr busy

bus

8/30/16 CS152,Fall2016

InstructionExecution

22

Executionofa RISC-Vinstructioninvolves:

1.instructionfetch2.decodeandregisterfetch3.ALUoperation4.memoryoperation(optional)5.writebacktoregisterfile(optional)

+thecomputationofthenextinstructionaddress

8/30/16 CS152,Fall2016

Microprogram Fragments

23

instr fetch: MA,A←PCPC←A+4IR←MemorydispatchonOpcode

canbetreatedasamacro

ALU: A←Reg[rs1]B←Reg[rs2]Reg[rd]←func(A,B)do instructionfetch

ALUi: A←Reg[rs1]B←Imm signextensionReg[rd]←Opcode(A,B)do instructionfetch

8/30/16 CS152,Fall2016

MicroprogramFragments(cont.)

24

LW: A←Reg[rs1]B←ImmMA←A+BReg[rd]←Memorydo instructionfetch

J: A←A - 4GetoriginalPCbackinAB←IRPC←JumpTarg(A,B)do instructionfetch

beq: A←Reg[rs1]B←Reg[rs2]If A==Bthengotobz-takendo instructionfetch

bz-taken: A←PCA←A- 4 GetoriginalPCbackinAB←BImm <<1 BImm =IR[31:27,16:10]PC←A+Bdo instructionfetch

JumpTarg(A,B)={A+(B[31:7]<<1)}

8/30/16 CS152,Fall2016

RISC-VMicrocontroller: firstattemptpureROMimplementation

25

nextstate

Opcodezero?

Busy(memory)

ControlSignals(17)

s

s

7

uProgram ROM

addr

data

uPC (state)

=2(opcode+status+s) words

Howbigis“s”?

ROMsize?

Wordsize?=control+sbits

8/30/16 CS152,Fall2016

MicroprogramintheROM worksheet

27

State Opzero? busyControlpoints next-state

ALU0 ALU * * A←Reg[rs1] ALU1ALU1 ALU * * B←Reg[rs2] ALU2ALU2 ALU * * Reg[rd]←func(A,B) fetch0fetch0 ALU * * MA,A←PC fetch1fetch1 ALU * yes .... fetch1fetch1 ALU * no IR←Memory fetch2fetch2 ALU * * PC←A+4 ?

Nextinstructionsequence

“*”denotesallcombinationspresent

8/30/16 CS152,Fall2016

MicroprogramintheROMCont.

29

StateOp zero?busyControlpoints next-state

ALUi0 ALU * * A←Reg[rs1] ALUi1ALUi1 ALU * * B ←Imm ALUi2ALUi2 ALU * * Reg[rd]←Op(A,B) fetch0...J0 J * * A←A- 4 J1J1 J * * B←IR J2J2 J * * PC←JumpTarg(A,B) fetch0...beq0 beq * * A←Reg[rs1] beq1beq1 beq * * B←Reg[rs2] beq2beq2 beq yes * A←PC beq3beq2 beq no * .... fetch0beq3 beq * * A←A- 4 beq4beq4 beq * * B←BImm beq5beq5 beq * * PC←A+B fetch0...

8/30/16 CS152,Fall2016

SizeofControlStore

31

RISC-V: w=5+2 c=17 s=?no.ofstepsperopcode=~5+fetch-sequence(3)no.ofstates≈

(8stepsperopcode)x(#ofopcodes)x(4statuscombos)=8x25 x4=1024states⇒ s=(10– 7)⇒ width is 20 bitsControlROMsize=1024x20bits≈ 20Kbits

size=2(w+s)x(c+s) ControlROM

data

status&opcode

addr

nextuPC

Controlsignals

uPC/w

/s

/c

8/30/16 CS152,Fall2016

ReducingControlStoreSize

32

• ReducetheROMheight(=addressbits)– reduceinputsbyextraexternallogic

eachinputbitdoublesthesizeofthecontrolstore– reducestatesby groupingopcodes

findcommonsequencesofactions– condenseinputstatusbits

combineallexceptionsintoone,i.e.,exception/no-exception

• ReducetheROMwidth– restrictthenext-stateencoding

Next,Waitformemory,...– encodecontrolsignals(verticalmicrocode)

Controlstorehastobefast⇒ expensive

8/30/16 CS152,Fall2016

RISC-V ControllerV2

33

uJumpType =next| spin| fetch| dispatch| ftrue | ffalse

ControlSignals(17)

ControlROM

address

data

+1

Opcode CL

uPC (state)

jumplogic

zero

uPC uPC+1

absolute

op-group

busy

uPCSrcinputencodingreduces

ROMheight

next-stateencodingreducesROMwidth

uJumpType

8/30/16 CS152,Fall2016

JumpLogic

34

uPCSrc =Case uJumpTypes

next ⇒ uPC+1

spin ⇒ if(busy)thenuPC elseuPC+1

fetch ⇒ absolute

dispatch ⇒ op-group

ftrue ⇒ if(zero)thenabsolute elseuPC+1

ffalse ⇒ if(zero)thenuPC+1 elseabsolute

8/30/16 CS152,Fall2016

InstructionFetch&ALU:RISC-V-Controller-2

35

State Controlpoints next-state

fetch0 MA,A←PCfetch1 IR←Memoryfetch2 PC←A+4...ALU0 A←Reg[rs1]ALU1 B←Reg[rs2]ALU2 Reg[rd]←func(A,B)

ALUi0 A←Reg[rs1]ALUi1 B←ImmALUi2 Reg[rd]←Op(A,B)

nextspindispatch

nextnextfetch

nextnextfetch

8/30/16 CS152,Fall2016

Load&Store: RISC-V-Controller-2

36

State Controlpoints next-state

LW0 A←Reg[rs1] nextLW1 B←Imm nextLW2 MA←A+B nextLW3 Reg[rd]←Memory spinLW4 fetch

SW0 A←Reg[rs1] nextSW1 B←BImm nextSW2 MA←A+B nextSW3 Memory←Reg[rs2] spinSW4 fetch

8/30/16 CS152,Fall2016

Branches: RISC-V-Controller-2

37

State Controlpoints next-state

beq0 A←Reg[rs1] nextbeq1 B←Reg[rs2] nextbeq2 A←PC ffalsebeq3 A←A- 4 nextbeq3 B←BImm<<1 nextbeq4 PC←A+B fetch

8/30/16 CS152,Fall2016

Jumps: RISC-V-Controller-2

38

State Controlpoints next-state

J0 A←A-4 nextJ1 B←IR nextJ2 PC←JumpTarg(A,B) fetch

JR0 A←Reg[rs1] nextJR1 PC←A fetch

JAL0 A←PC nextJAL1 Reg[1]←A nextJAL2 A←A-4 nextJAL3 B←IR nextJAL4 PC←JumpTarg(A,B) fetch

8/30/16 CS152,Fall2016

VAX11-780Microcode

39

8/30/16 CS152,Fall2016

ImplementingComplexInstructions

40

enMem

MA

addr

data

ldMA

Memory

busy

MemWrt

Bus 32

zero?

A B

ALUOp ldA ldB

ALU

enALU

RegWrtenReg

addr

data

rs1rs2rd32(PC)1(RA)

RegSel

32GPRs+PC...

32-bitReg

3

rs1rs2rd

ImmSel

IR

Opcode

ldIR

ImmedSelect

enImm

2

rd ←M[(rs1)]op(rs2) Reg-Memory-src ALUopM[(rd)]←(rs1)op(rs2) Reg-Memory-dst ALUopM[(rd)]←M[(rs1)]opM[(rs2)] Mem-MemALUop

8/30/16 CS152,Fall2016

Mem-Mem ALUInstructions:RISC-V-Controller-2

41

Mem-MemALUopM[(rd)]←M[(rs1)]opM[(rs2)]

ALUMM0 MA← Reg[rs1] nextALUMM1 A←Memory spinALUMM2 MA←Reg[rs2] nextALUMM3 B←Memory spinALUMM4 MA←Reg[rd] nextALUMM5 Memory←func(A,B) spinALUMM6 fetch

Complexinstructionsusuallydonotrequiredatapath modificationsinamicroprogrammed implementation

-- onlyextraspaceforthecontrolprogram

Implementing theseinstructionsusingahardwiredcontrollermightrequiredatapath modifications

8/30/16 CS152,Fall2016

PerformanceIssues

42

Microprogrammed control⇒ multiplecyclesperinstruction

Cycletime?tC >talu-regfile +tuROM

Goodperformance,relativetoasingle-cyclehardwiredimplementation,canbeachieved:• Totalexecutiontime(numberofcycles)

tailoredperinstruction• Eachuop fast:smallROM,simple

transfers

8/30/16 CS152,Fall2016

HorizontalvsVerticalµCode

§ Horizontalµcode haswiderµinstructions– Multipleparalleloperationsperµinstruction– Fewer microcodestepspermacroinstruction– Sparserencoding⇒ morebits

§ Verticalµcode hasnarrowerµinstructions– Typicallyasingledatapath operationperµinstruction– Moremicrocodestepspermacroinstruction– Morecompact⇒ lessbits

§ Nanocoding– Triestocombinebestofhorizontalandverticalµcode

43

#µInstructions

BitsperµInstruction

uCode ROM

8/30/16 CS152,Fall2016

Nanocoding

44

§ MC68000had17-bitµcodecontainingeither10-bitµjumpor9-bitnanoinstruction pointer– Nanoinstructions were68bitswide,decodedtogive196controlsignals

µcodeROM

nanoaddress

µcodenext-state

µaddress

uPC (state)

nanoinstructionROMdata

Exploitsrecurringcontrolsignalpatternsinµcode,e.g.,

ALU0 A←Reg[rs1]...ALUi0 A←Reg[rs1]...

8/30/16 CS152,Fall2016

Microprogramming inIBM360

Onlythefastestmodels(75and95)werehardwired

45

M30 M40 M50 M65Datapathwidth(bits) 8 16 32 64

µinst width(bits) 50 52 85 87

µcodesize(Kµinsts) 4 4 2.75 2.75

µstoretechnology CCROS TCROS BCROS BCROS

µstorecycle(ns) 750 625 500 200

memorycycle(ns) 1500 2500 2000 750

Rentalfee($K/month) 4 7 15 35

8/30/16 CS152,Fall2016

IBMCardCapacitorRead-OnlyStorage

46[IBMJournal,January1961]

PunchedCardwithmetalfilm

Fixedsensingplates

8/30/16 CS152,Fall2016

MicroprogrammingthrivedintheSeventies

§ SignificantlyfasterROMsthanDRAMswereavailable§ Forcomplexinstructionsets,datapathandcontrollerwere

cheaperandsimpler§ Newinstructions ,e.g.,floatingpoint,couldbesupported

withoutdatapathmodifications§ Fixingbugs inthecontrollerwaseasier§ ISAcompatibilityacrossvariousmodelscouldbeachieved

easilyandcheaply

48

Exceptforthecheapestandfastestmachines,allcomputersweremicroprogrammed

8/30/16 CS152,Fall2016

WritableControlStore(WCS)§ ImplementcontrolstoreinRAMnot ROM

– MOSSRAMmemoriesnowalmostasfastascontrolstore(corememories/DRAMswere2-10xslower)

– Bug-freemicroprograms difficulttowrite

§ User-WCSprovidedasoptiononseveralminicomputers– Alloweduserstochangemicrocodeforeachprocessor

§ User-WCSfailed– Littleornoprogrammingtoolssupport– Difficulttofitsoftwareintosmallspace– MicrocodecontroltailoredtooriginalISA,lessusefulforothers– LargeWCSpartofprocessorstate- expensivecontext switches– Protectiondifficultifusercanchangemicrocode– Virtualmemoryrequiredrestartable microcode

49

8/30/16 CS152,Fall2016

Microprogramming isfarfromextinct

§ PlayedacrucialroleinmicrosoftheEighties• DECuVAX,Motorola68Kseries,Intel286/386

§ Playsanassistingroleinmostmodernmicros– e.g.,AMDBulldozer,IntelIvyBridge,IntelAtom,IBMPowerPC,…– Mostinstructionsexecuteddirectly,i.e.,withhard-wiredcontrol– Infrequently-usedand/orcomplicatedinstructionsinvokemicrocode

§ Patchablemicrocodecommonforpost-fabricationbugfixes,e.g.Intelprocessorsloadµcodepatchesatbootup

50

8/30/16 CS152,Fall2016

Acknowledgements

§ Theseslidescontainmaterialdevelopedandcopyrightby:– Arvind(MIT)– KrsteAsanovic(MIT/UCB)– JoelEmer(Intel/MIT)– JamesHoe(CMU)– JohnKubiatowicz(UCB)– DavidPatterson(UCB)

§ MITmaterialderivedfromcourse6.823§ UCBmaterialderivedfromcourseCS252

51