lecture04 ISA Principles - GitHub Pages · 1 2 Register-Memory IBM 360/370, Intel 80x86, Motorola...

Lecture04:ISAPrinciples

CSE564ComputerArchitectureSummer2017

DepartmentofComputerScienceandEngineering

YonghongYan

[email protected]

www.secs.oakland.edu/~yan

1

Contents

1.   Introduc@on2.   ClassifyingInstruc@onSetArchitectures3.   MemoryAddressing4.   TypeandSizeofOperands5.   Opera@onsintheInstruc@onSet6.   Instruc@onsforControlFlow7.   EncodinganInstruc@onSet8.   CrosscuMngIssues:TheRoleofCompilers9.   RISC-VISA•  Supplement

–  MIPSISA

–  RISCvsCISC–  CompilercompilaFonstages

–  ISAHistorical•  AppendixL

–  ComparisonofISA

•  AppendixK2

1Introduc@on

Instruc@onSetArchitecture–theporFonofthemachine

visibletotheassemblylevelprogrammerortothe

compilerwriter

–  Tousethehardwareofacomputer,wemustspeakits

language

–  Thewordsofacomputerlanguagearecalledinstruc,ons,and

itsvocabularyiscalledaninstruc,onset

3

instruc@onset

soRware

hardware

Instr.# OperaFon+Operands

i movl-4(%ebp),%eax

(i+1)addl%eax,(%edx)

(i+2)cmpl8(%ebp),%eax

(i+3)jl L5

:

L5:

4

sum.sforX86

•  h]p://www.hep.wisc.edu/~pinghc/

x86AssmTutorial.htm

•  h]ps://en.wikibooks.org/wiki/X86_Assembly/SSE

5

sum.sforRISC-V

h]ps://riscv.org/

2ClassifyingInstruc@onSetArchitectures

6

OperandstorageinCPU Wherearetheyotherthanmemory

#explicitoperandsnamedperinstruc@on

Howmany?Min,Max,Average

Addressingmode HowtheeffecFveaddressforan

operandcalculated?Canalluseany

mode?

Opera@ons WhataretheopFonsfortheopcode?

Type&sizeofoperands Howistypingdone?Howisthesize

specified?

Thesechoicescri,callyaffectnumberofinstruc,ons,CPI,and

CPUcycle,me

ISAClassifica@on

•  MostbasicdifferenFaFon:internalstorageinaprocessor

–  Operandsmaybenamedexplicitlyorimplicitly

•  Majorchoices:

1.  Inanaccumulatorarchitectureoneoperandisimplicitlythe

accumulator=>similartocalculator

2.  Theoperandsinastackarchitectureareimplicitlyonthe

topofthestack

3.  Thegeneral-purposeregisterarchitectureshaveonlyexplicitoperands–eitherregistersormemorylocaFon

7

RegisterMachines

• Howmanyregistersaresufficient?

• General-purposeregistersvs.special-purposeregisters•  compilerflexibilityandhand-op,miza,on

•  TwomajorconcernsforarithmeFcandlogicalinstrucFons(ALU)

1.Twoorthreeoperands

•  X+Y⇒X•  X+Y⇒Z

2.Howmanyoftheoperandsmaybememoryaddresses(0–3)

8Hence,registerarchitectureclassifica@on(#mem,#operands)

Number of memory addresses

Maximum number of operands

allowed

Type of Architecture

Examples

0 3 Load-Store Alpha, ARM, MIPS, PowerPC, SPARC, SuperH, TM32

1 2 Register-Memory IBM 360/370, Intel 80x86, Motorola 68000, TI TMS320C54x

2 2 Memory – memory VAX (also has 3 operand formats)

3 3 Memory - memory VAX (also has 2 operand formats)

(0,3):Register-Register

•  ALUisRegistertoRegister–alsoknownaspureReducedInstruc-onSetComputer(RISC)

o Advantages–  simplefixedlengthinstrucFonencoding

–  decodeissimplesinceinstrucFontypesaresmall

–  simplecodegeneraFonmodel

–  instrucFonCPItendstobeveryuniform

–  exceptformemoryinstrucFonsofcourse

–  butthereareonly2ofthem-loadandstore

o Disadvantages–  instrucFoncounttendstobehigher

–  someinstrucFonsareshort-wasFnginstrucFonwordbits

9

(1,2):Register-Memory

10

EvolvedRISCandalsooldCISC

•newRISCmachinescapableofdoingspecula,veloads

•predicatedand/ordeferredloadsarealsopossible

o Advantages–  dataaccesstoALUimmediatewithoutloadingfirst

–  instrucFonformatisrelaFvelysimpletoencode

–  codedensityisimprovedoverRegister(0,3)model

o Disadvantages–  operandsarenotequivalent-sourceoperandmaybe

destroyed

–  needformemoryaddressfieldmaylimit#ofregisters

–  CPIwillvary•ifmemorytargetisinL0cachethennotsobad

•ifnot-lifegetsmiserable

(2,2)or(3,3):Memory-Memory

11

TrueandmostcomplexCISCmodel

•currentlyexFnctandlikelytoremainso

•morecomplexmemoryacFonsarelikelytoappearbutnot

directlylinkedtotheALU

o Advantages–  mostcompactcode

–  doesn’twasteregistersfortemporaryvalues

• goodideaforuseoncedata-e.g.streamingmedia

o Disadvantages–  largevariaFonininstrucFonsize-mayneedashoe-horn

–  largevariaFoninCPI-i.e.workperinstrucFon

–  exacerbatestheinfamousmemorybo]leneck

• registerfilereducesmemoryaccessesifreused

Notusedtoday

Summary:TradeoffsfortheISAClasses

12

Type Advantages Disadvantages

Register-register (0,3)

Simple, fixed length instruction encoding. Simple code generation model. Instructions take similar numbers of clocks to execute.

Higher instruction count than architectures with memory references in the instructions. More instructions and lower instruction density leads to larger programs

Register-memory (1,2)

Data can be accessed without a separate load instruction first. Instruction format tends to be easy to encode and yields good density

Operands are not equivalent since a source operand is destroyed. Encoding a register number and a memory address in each instruction may restrict the number of registers. Clocks per instruction vary by operand location

Memory-memory (2,2) or (3,3)

Most compact. Does not waste registers for temporaries.

Large variation in instruction size, especially for three-operand instructions. In addition, large variation in work per instruction. Memory accesses create memory bottleneck. (Not used today)

3MemoryAddressing

• Objectshavebyteaddresses– thenumberofbytescountedfromthebeginningofmemory

• ObjectLength:– bytes(8bits),halfwords(16bits),– words(32bits),anddoublewords(64bits).– Thetypeisimpliedinopcode,e.g.,

•  LDB–loadbyte•  LDW–loadword,etc

• ByteOrdering– Li]leEndian:putsthebytewhoseaddressisxx00attheleastsignificantposiFonintheword.(7,6,5,4,3,2,1,0)

– BigEndian:putsthebytewhoseaddressisxx00atthemostsignificantposiFonin

theword.(0,1,2,3,4,5,6,7)

•  Problemoccurswhenexchangingdataamongmachineswithdifferent

orderings

13

Interpre@ngMemoryAddresses

•  AlignmentIssues

–  Accessestoobjectslargerthanabytemustbealigned.Anaccessto

anobjectofsizesbytesatbyteaddressAisalignedifAmods=0.

–  MisalignmentcauseshardwarecomplicaFons,sincethememoryis

typicallyalignedonawordoradouble-wordboundary

–  Misalignmenttypicallyresultsinanalignmentfaultthatmustbe

handledbytheOS

•  Hence–  byteaddressisanything-nevermisaligned

–  halfword-evenaddresses-loworderaddressbit=0(XXXXXXX0)elsetrap

–  word-loworder2addressbits=0(XXXXXX00)elsetrap–  doubleword-loworder3addressbits=0(XXXXX000)elsetrap

14

MemoryAlignment

15

16

Aligned/MisalignedAddresses

17

AddressingModes

•  Howarchitecturespecifytheeffec@veaddressofanobject?

–  Effec-veaddress:theactualmemoryaddressspecifiedbythe

addressingmode.

•  E.g.Mem[R[R1]]referstothecontentsofthememory

loca,onwhoseloca,onisgiventhecontentsofregister1

(R1).

•  AddressingModes:–  Register.–  Immediate

–  Displacement

–  Registerindirect,……..

18

AddressModes

AddressingModeImpacts

•  InstrucFoncounts•  ArchitectureComplexity

•  CPI

19

20

Summaryofuseofmemoryaddressingmodes

21

Displacementvaluesarewidelydistributed

ImpactinstrucFonlength

DisplacementAddressingMode

•  Benchmarksshow–  12bitsofdisplacementwouldcaptureabout75%ofthefull32-bit

displacements–  16bitsshouldcaptureabout99%

•  Remember:–  [email protected],thechoiceisatleast12-16bits

•  Foraddressesthatdofitindisplacementsize:

Add R4,10000(R0)

•  Foraddressesthatdon’tfitindisplacementsize,thecompilermustdo

thefollowing:

Load R1,1000000

AddR1,R0

Add R4,0(R1)

22

ImmediateAddressingMode

•  Usedwherewewanttogettoanumericalvalueinan

instrucFon

•  Around25%oftheopera@onshaveanimmediateoperand

23

Athighlevel:a=b+3;if(a>17)gotoAddr

AtAssemblerlevel:LoadR2,#3AddR0,R1,R2LoadR2,#17CMPBGTR1,R2LoadR1,AddressJump(R1)

About25%ofdatatransferandALUopera@onshaveanimmediateoperand

24


Numberofbitsforimmediate

•  16bitswouldcaptureabout80%and8bitsabout50%.

25


Summary:MemoryAddressing

•  Anewarchitectureexpectedtosupportatleast:displacement,immediate,andregisterindirect–  represent75%to99%oftheaddressingmodes

•  Thesizeoftheaddressfordisplacementmodetobeat

least12-16bits–  capture75%to99%ofthedisplacements

•  Thesizeoftheimmediatefieldtobeatleast8-16bits–  capture50%to80%oftheimmediates

Processorsrelyoncompilerstogeneratecodesusingthose

addressingmode

26

4TypeAndSizeofOperands

• Thetypeoftheoperandisusuallyencodedintheopcode–  e.g.,LDB–loadbyte;LDW–loadword

• Commonoperandtypes:(implytheirsizes)

Character(8bitsor1byte)

Halfword(16bitsor2bytes)

Word(32bitsor4bytes)

Doubleword(64bitsor8bytes)

SingleprecisionfloaFngpoint(4bytesor1word)

DoubleprecisionfloaFngpoint(8bytesor2words)

ü CharactersarealmostalwaysinASCII

ü 16-bitUnicode(usedinJava)isgainingpopularityü  Integersaretwo�scomplementbinary

ü  Floa,ngpointsfollowtheIEEEstandard754• Somearchitecturessupportpackeddecimal:4bitsareusedto

encodethevalues0-9;2decimaldigitsarepackedintoeachbyte27

Howisthetypeofanoperanddesignated?

Distribu@onofdataaccessesbysize

28

Summary:TypeandSizeofoperands

•  32-architecturesupports8-,16-,and32-bitintegers,32-bitand64-bitIEEE754floaFng-pointdata.

•  Anew64-bitaddressarchitecturesupports64-bitintegers•  MediaprocessorandDSPsneedwideraccumulaFng

registersforaccuracy.

29

5Opera@onsintheInstruc@onSet

•  AllcomputersgenerallyprovideafullsetofoperaFonsfor

thefirstthreecategories

•  AllcomputersmusthavesomeinstrucFonsupportforbasic

systemfuncFons

•  GraphicsinstrucFonstypicallyoperateonmanysmaller

dataitemsinparallel

30

Top10instruc@onsforthe80x86

31

6Instruc@onsforControlFlow

•  ControlinstrucFonschangetheflowofcontrol:–  insteadofexecuFngthenextinstrucFon,theprogram

branchestotheaddressspecifiedinthebranchinginstrucFons

•  Theybreakthepipeline–  DifficulttoopFmizeout

–  ANDtheyarefrequent•  FourtypesofcontrolinstrucFons

–  Condi@onalbranches–  Jumps–uncondi@onaltransfer–  Procedurecalls–  Procedurereturns

32

Breakdownofcontrolflowinstruc@ons

–  Condi@onalbranches–  Jumps–uncondi@onaltransfer–  Procedurecalls–  Procedurereturns

•  Issues:–  Whereisthetargetaddress?Howtospecifyit?–  Whereisreturnaddresskept?Howaretheargumentspassed?

(calls)–  Whereisreturnaddress?Howaretheresultspassed?(returns)

33

AddressingModesforControlFlowInstruc@ons

•  PC-relaFve(ProgramCounter)

–  supplyadisplacementaddedtothePC

•  KnownatcompileFmeforjumps,branches,andcalls(specified

withintheinstrucFon)

–  thetargetisoxennearthecurrentinstrucFon•  requiringfewerbits•  independentlyofwhereitisloaded(posiFonindependence)

•  Registerindirectaddressing–dynamicaddressing

–  ThetargetaddressmaynotbeknownatcompileFme

–  Namingaregisterthatcontainsthetargetaddress

•  Caseorswitchstatements

•  VirtualfuncFonsormethodsinC++orJava

•  High-orderfuncFonsorfuncFonpointersinCorC++•  Dynamicallysharedlibraries

34

Branchdistances

35

Condi@onalBranchOp@ons

36

Figure2.21MajormethodsforevaluaFngbranchcondiFons

ComparisonTypevs.Frequency

•  Mostloopsgofrom0ton.

•  Mostbackwardbranches

areloops–takenabout

90%

37

Program % backward branches

% all control instructions that modify PC

gcc 26% 63% spice 31% 63% TeX 17% 70% Average 25% 65%

ProcedureInvoca@onOp@ons

•  Procedurecallsandreturns–  controltransfer–  statesaving;thereturnaddressmustbesaved

Newerarchitecturesrequirethecompilertogeneratestoresandloads

foreachregistersavedandrestored

•  TwobasicconvenFonsinusetosaveregisters–  callersaving:thecallingproceduremustsavetheregistersthatit

wantspreservedforaccessaxerthecall

•  thecalledprocedureneednotworryaboutregisters–  calleesaving:thecalledproceduremustsavetheregistersitwantsto

use

•  leavingthecallerunrestrained

mostrealsystemstodayuseacombinaFonofboth

•  ApplicaFonbinaryinterface(ABI)thatsetdownthebasicrulesastowhichregisterbecallersavedandwhichshouldbecalleesaved

38

7.EncodinganInstruc@onSet

•  Opcode:specifyingtheoperaFon•  #ofoperand

–  addressingmode

–  addressspecifier:tellswhataddressingmodeisused

–  Load-storecomputer

•  Onlyonememoryoperand

•  Onlyoneortwoaddressingmodes

•  ThearchitecturemustbalancingseveralcompeFngforceswhen

encodingtheinstrucFonset:

–  #ofregisters&&Addressingmodes

–  Sizeofregisters&&Addressingmodefields;AverageinstrucFonsize&&Averageprogramsize.

–  EasytohandleinpipelineimplementaFon.

39

x86vs.AlphaInstruc@onFormats

•  x86:

•  Alpha:

40

Instruc@onLengthTradeoffs

•  Fixedlength:LengthofallinstrucFonsthesame

+EasiertodecodesingleinstrucFoninhardware

+EasiertodecodemulFpleinstrucFonsconcurrently

--WastedbitsininstrucFons(Whyisthisbad?)

--Harder-to-extendISA(howtoaddnewinstrucFons?)

•  Variablelength:LengthofinstrucFonsdifferent(determinedbyopcodeandsub-opcode)

+Compactencoding(Whyisthisgood?)

Intel432:Huffmanencoding(sortof).6to321bitinstrucFons.How?

--MorelogictodecodeasingleinstrucFon

--HardertodecodemulFpleinstrucFonsconcurrently

•  Tradeoffs–  Codesize(memoryspace,bandwidth,latency)vs.hardwarecomplexity

–  ISAextensibilityandexpressiveness

–  Performance?Smallercodevs.imperfectdecode

41

UniformvsNon-uniformDecode

•  Uniformdecode:SamebitsineachinstrucFoncorrespond

tothesamemeaning

–  OpcodeisalwaysinthesamelocaFon

–  immediatevalues,…

–  Many�RISC�ISAs:Alpha,MIPS,SPARC

+Easierdecode,simplerhardware

+Enablesparallelism:generatetargetaddressbeforeknowingtheinstrucFon

isabranch

--RestrictsinstrucFonformat(fewerinstrucFons?)orwastesspace

•  Non-uniformdecode

–  E.g.,opcodecanbethe1st-7thbyteinx86+MorecompactandpowerfulinstrucFonformat

--Morecomplexdecodelogic

42

Threebasicvaria@oninstruc@onencoding:variablelength,fixedlength,andhybrid

43

Thelengthof80x86(CISC)instruc@onsvariesbetween1and17bytes.

ThelengthofmostRISCISAinstruc@onsare4bytes.

&&

X86programaregenerallysmallerthanRISCISA.

ToreduceRISCcodesize

ReducedCodeSizeinRISCs

•  Hybridencoding–support16-bitand32-bitinstrucFonsinRISC,eg.ARMThumb,MIPS16andRISC-V

–  narrowinstrucFonssupportfeweroperaFons,smalleraddressand

immediatefields,fewerregisters,andtwo-addressformatratherthanthe

classicthree-addressformat

–  claimacodesizereducFonofupto40%

•  CompressioninIBM’sCodePack–  AddshardwaretodecompressinstrucFonsastheyarefetchedfrom

memoryonaninstrucFoncachemiss

–  TheinstrucFoncachecontainsfull32-bitinstrucFons,butcompressed

codeiskeptinmainmemory,ROMs,andthedisk

–  ClaimcodereducFon35%-40%

–  PowerPCcreateaHashtableinmemorythatmapbetweencompressed

anduncompressedaddress.Codesize35%~40%

•  Hitachi’sSuperH:fixed16-bitformat

–  16ratherthan32registers–  fewerinstrucFons

44

SummaryofInstruc@onEncoding

•  Threechoices–  Variable,fixedandhybrid–  Notethedifferencesofhybridandvariable

•  ChoicesofinstrucFonencodingisatradeoffbetween–  Forperformance:fixedencoding

–  Forcodesize:variableencoding

•  HowhybridencodingisusedinRISCtoreducecodesize–  16bitand32bit

•  Ingeneral,wesee:–  RISC:fixedorhybrid–  CISC:variable

45

8TheRoleofCompilers

•  Almostallprogrammingisdoneinhigh-levellanguages.

–  AnISAisessenFallyacompliertarget.

•  Seebackupslidesforthecompila,onstagebymost

compiler,e.g.gcc

•  Compilergoals:

–  Allcorrectprogramsexecutecorrectly

–  Mostcompiledprogramsexecutefast(opFmizaFons)

–  FastcompilaFon

–  Debuggingsupport

46

TypicalModernCompilerStructure

47

Figure A.19 Compilers typically consist of two to four passes, with more highly optimizing compilers having more passes. This structure maximizes the probability that a program compiled at various levels of optimization will produce the same output when given the same input. The optimizing passes are designed to be optional and may be skipped when faster compilation is the goal and lower-quality code is acceptable. A pass is simply one phase in which the compiler reads and transforms the entire program. (The term phase is often used inter-changeably with pass.) Because the optimizing passes are separated, multiple languages can use the same optimizing and code generation passes. Only a new front end is required for a new language.

Op@miza@onTypes

•  Highlevel–doneatornearsourcecodelevel–  Ifprocedureiscalledonlyonce,putitin-lineandsaveCALL–  moregeneralcase:ifcall-count<somethreshold,putthemin-line

•  Local–donewithinstraight-linecode–  commonsub-expressionsproducesamevalue–eitherallocatea

registerorreplacewithsinglecopy

–  constantpropagaFon–replaceconstantvaluedvariablewiththeconstant

–  stackheightreducFon–re-arrangeexpressiontreetominimize

temporarystorageneeds

•  Global–acrossabranch–  copypropagaFon–replaceallinstancesofavariableAthathas

beenassignedX(i.e.,A=X)withX.

–  codemoFon–removecodefromaloopthatcomputessamevalue

eachiteraFonoftheloopandputitbeforetheloop

–  simplifyoreliminatearrayaddressingcalculaFonsinloops

48

Op@miza@onTypes

•  Machine-dependentopFmizaFons–basedonmachine

knowledge

–  strengthreducFon–replacemulFplybyaconstantwithshixs

andadds

•  wouldmakesenseiftherewasnohardwaresupportfor

MUL

•  atrickierversion:17×=arithmeFclexshix4andadd

•  pipeliningscheduling–reorderinstrucFonstoimprove

pipelineperformance

–  dependencyanalysis–  branchoffsetopFmizaFon-reordercodetominimizebranch

offsets

49

Majortypesofop@miza@ons

50

ComplierOp@miza@ons–ChangeinIC

51

•  L0–unopFmized

•  L1–localopts,codescheduling,&localreg.allocaFon•  L2–globaloptsandlooptransformaFons,&globalreg.AllocaFon

•  L3–procedureintegraFon

gcc-O2hello.c-ohello

CompilerBasedRegisterOp@miza@on

•  Compilerassumessmallnumberofregisters(16-32)–  OpFmizinguseisuptocompiler

–  HLLprogramshavenoexplicitreferencestoregisters

–  usually–isthisalwaystrue?

•  CompilerApproach

–  Assignsymbolicorvirtualregistertoeachcandidatevariable

–  Map(unlimited)symbolicregisterstorealregisters

–  Symbolicregistersthatdonotoverlapcansharerealregisters

–  Ifyourunoutofrealregisterssomevariables

•  Spilling

GraphColoring

•  Givenagraphofnodesandedges–  Assignacolortoeachnode–  Adjacentnodeshavedifferentcolors–  Useminimumnumberofcolors

•  RegistraFonallocaFon–  Nodesaresymbolicregisters

–  Tworegistersthatareliveinthesameprogramfragmentare

joinedbyanedge

–  Trytocolorthegraphwithncolors,wherenisthenumberof

realregisters

–  Nodesthatcannotbecoloredareplacedinmemory

h]ps://en.wikipedia.org/wiki/Graph_coloring

Iron-codeSummary

•  Sec-onA.2—Usegeneral-purposeregisterswithaload-storearchitecture.•  Sec-onA.3—Supporttheseaddressingmodes:displacement(withanaddressoffset

sizeof12to16bits),immediate(size8to16bits),andregisterindirect.•  Sec-onA.4—Supportthesedatasizesandtypes:8-,16-,32-,and64-bitintegersand

64-bitIEEE754floa@ng-pointnumbers.–  Nowwesee16-bitFPfordeeplearninginGPU

•  hup://www.nextplavorm.com/2016/09/13/nvidia-pushes-deep-learning-inference-new-pascal-gpus/

•  Sec-onA.5—Supportthesesimpleinstruc@ons,sincetheywilldominatethenumberofinstruc@onsexecuted:load,store,add,subtract,moveregister-register,andshiR.

•  Sec-onA.6—Compareequal,comparenotequal,compareless,branch(withaPC-rela@veaddressatleast8bitslong),jump,call,andreturn.

•  Sec-onA.7—Usefixedinstruc@onencodingifinterestedinperformance,andusevariableinstruc@onencodingifinterestedincodesize.

•  Sec-onA.8—Provideatleast16general-purposeregisters,besurealladdressingmodesapplytoalldatatransferinstruc@ons,andaimforaminimalistIS

–  ORenuseseparatefloa@ng-pointregisters.–  Thejus@fica@onistoincreasethetotalnumberofregisterswithoutraisingproblems

intheinstruc@onfor-matorinthespeedofthegeneral-purposeregisterfile.Thiscompromise,however,isnotorthogonal.

54

RealWorldISA

55

Thedetailsistotrade-off

56

lecture04 ISA Principles - GitHub Pages · 1 2 Register-Memory IBM 360/370, Intel 80x86, Motorola...

Documents

Transcript of lecture04 ISA Principles - GitHub Pages · 1 2 Register-Memory IBM 360/370, Intel 80x86, Motorola...