lecture04 ISA Principles - GitHub Pages · 1 2 Register-Memory IBM 360/370, Intel 80x86, Motorola...
Transcript of lecture04 ISA Principles - GitHub Pages · 1 2 Register-Memory IBM 360/370, Intel 80x86, Motorola...
Lecture04:ISAPrinciples
CSE564ComputerArchitectureSummer2017
DepartmentofComputerScienceandEngineering
YonghongYan
www.secs.oakland.edu/~yan
1
Contents
1. Introduc@on2. ClassifyingInstruc@onSetArchitectures3. MemoryAddressing4. TypeandSizeofOperands5. Opera@onsintheInstruc@onSet6. Instruc@onsforControlFlow7. EncodinganInstruc@onSet8. CrosscuMngIssues:TheRoleofCompilers9. RISC-VISA• Supplement
– MIPSISA
– RISCvsCISC– CompilercompilaFonstages
– ISAHistorical• AppendixL
– ComparisonofISA
• AppendixK2
1Introduc@on
Instruc@onSetArchitecture–theporFonofthemachine
visibletotheassemblylevelprogrammerortothe
compilerwriter
– Tousethehardwareofacomputer,wemustspeakits
language
– Thewordsofacomputerlanguagearecalledinstruc,ons,and
itsvocabularyiscalledaninstruc,onset
3
instruc@onset
soRware
hardware
Instr.# OperaFon+Operands
i movl-4(%ebp),%eax
(i+1)addl%eax,(%edx)
(i+2)cmpl8(%ebp),%eax
(i+3)jl L5
:
L5:
4
sum.sforX86
• h]p://www.hep.wisc.edu/~pinghc/
x86AssmTutorial.htm
• h]ps://en.wikibooks.org/wiki/X86_Assembly/SSE
5
sum.sforRISC-V
h]ps://riscv.org/
2ClassifyingInstruc@onSetArchitectures
6
OperandstorageinCPU Wherearetheyotherthanmemory
#explicitoperandsnamedperinstruc@on
Howmany?Min,Max,Average
Addressingmode HowtheeffecFveaddressforan
operandcalculated?Canalluseany
mode?
Opera@ons WhataretheopFonsfortheopcode?
Type&sizeofoperands Howistypingdone?Howisthesize
specified?
Thesechoicescri,callyaffectnumberofinstruc,ons,CPI,and
CPUcycle,me
ISAClassifica@on
• MostbasicdifferenFaFon:internalstorageinaprocessor
– Operandsmaybenamedexplicitlyorimplicitly
• Majorchoices:
1. Inanaccumulatorarchitectureoneoperandisimplicitlythe
accumulator=>similartocalculator
2. Theoperandsinastackarchitectureareimplicitlyonthe
topofthestack
3. Thegeneral-purposeregisterarchitectureshaveonlyexplicitoperands–eitherregistersormemorylocaFon
7
RegisterMachines
• Howmanyregistersaresufficient?
• General-purposeregistersvs.special-purposeregisters• compilerflexibilityandhand-op,miza,on
• TwomajorconcernsforarithmeFcandlogicalinstrucFons(ALU)
1.Twoorthreeoperands
• X+Y⇒X• X+Y⇒Z
2.Howmanyoftheoperandsmaybememoryaddresses(0–3)
8Hence,registerarchitectureclassifica@on(#mem,#operands)
Number of memory addresses
Maximum number of operands
allowed
Type of Architecture
Examples
0 3 Load-Store Alpha, ARM, MIPS, PowerPC, SPARC, SuperH, TM32
1 2 Register-Memory IBM 360/370, Intel 80x86, Motorola 68000, TI TMS320C54x
2 2 Memory – memory VAX (also has 3 operand formats)
3 3 Memory - memory VAX (also has 2 operand formats)
(0,3):Register-Register
• ALUisRegistertoRegister–alsoknownaspureReducedInstruc-onSetComputer(RISC)
o Advantages– simplefixedlengthinstrucFonencoding
– decodeissimplesinceinstrucFontypesaresmall
– simplecodegeneraFonmodel
– instrucFonCPItendstobeveryuniform
– exceptformemoryinstrucFonsofcourse
– butthereareonly2ofthem-loadandstore
o Disadvantages– instrucFoncounttendstobehigher
– someinstrucFonsareshort-wasFnginstrucFonwordbits
9
(1,2):Register-Memory
10
EvolvedRISCandalsooldCISC
•newRISCmachinescapableofdoingspecula,veloads
•predicatedand/ordeferredloadsarealsopossible
o Advantages– dataaccesstoALUimmediatewithoutloadingfirst
– instrucFonformatisrelaFvelysimpletoencode
– codedensityisimprovedoverRegister(0,3)model
o Disadvantages– operandsarenotequivalent-sourceoperandmaybe
destroyed
– needformemoryaddressfieldmaylimit#ofregisters
– CPIwillvary•ifmemorytargetisinL0cachethennotsobad
•ifnot-lifegetsmiserable
(2,2)or(3,3):Memory-Memory
11
TrueandmostcomplexCISCmodel
•currentlyexFnctandlikelytoremainso
•morecomplexmemoryacFonsarelikelytoappearbutnot
directlylinkedtotheALU
o Advantages– mostcompactcode
– doesn’twasteregistersfortemporaryvalues
• goodideaforuseoncedata-e.g.streamingmedia
o Disadvantages– largevariaFonininstrucFonsize-mayneedashoe-horn
– largevariaFoninCPI-i.e.workperinstrucFon
– exacerbatestheinfamousmemorybo]leneck
• registerfilereducesmemoryaccessesifreused
Notusedtoday
Summary:TradeoffsfortheISAClasses
12
Type Advantages Disadvantages
Register-register (0,3)
Simple, fixed length instruction encoding. Simple code generation model. Instructions take similar numbers of clocks to execute.
Higher instruction count than architectures with memory references in the instructions. More instructions and lower instruction density leads to larger programs
Register-memory (1,2)
Data can be accessed without a separate load instruction first. Instruction format tends to be easy to encode and yields good density
Operands are not equivalent since a source operand is destroyed. Encoding a register number and a memory address in each instruction may restrict the number of registers. Clocks per instruction vary by operand location
Memory-memory (2,2) or (3,3)
Most compact. Does not waste registers for temporaries.
Large variation in instruction size, especially for three-operand instructions. In addition, large variation in work per instruction. Memory accesses create memory bottleneck. (Not used today)
3MemoryAddressing
• Objectshavebyteaddresses– thenumberofbytescountedfromthebeginningofmemory
• ObjectLength:– bytes(8bits),halfwords(16bits),– words(32bits),anddoublewords(64bits).– Thetypeisimpliedinopcode,e.g.,
• LDB–loadbyte• LDW–loadword,etc
• ByteOrdering– Li]leEndian:putsthebytewhoseaddressisxx00attheleastsignificantposiFonintheword.(7,6,5,4,3,2,1,0)
– BigEndian:putsthebytewhoseaddressisxx00atthemostsignificantposiFonin
theword.(0,1,2,3,4,5,6,7)
• Problemoccurswhenexchangingdataamongmachineswithdifferent
orderings
13
Interpre@ngMemoryAddresses
• AlignmentIssues
– Accessestoobjectslargerthanabytemustbealigned.Anaccessto
anobjectofsizesbytesatbyteaddressAisalignedifAmods=0.
– MisalignmentcauseshardwarecomplicaFons,sincethememoryis
typicallyalignedonawordoradouble-wordboundary
– Misalignmenttypicallyresultsinanalignmentfaultthatmustbe
handledbytheOS
• Hence– byteaddressisanything-nevermisaligned
– halfword-evenaddresses-loworderaddressbit=0(XXXXXXX0)elsetrap
– word-loworder2addressbits=0(XXXXXX00)elsetrap– doubleword-loworder3addressbits=0(XXXXX000)elsetrap
14
MemoryAlignment
15
16
Aligned/MisalignedAddresses
17
AddressingModes
• Howarchitecturespecifytheeffec@veaddressofanobject?
– Effec-veaddress:theactualmemoryaddressspecifiedbythe
addressingmode.
• E.g.Mem[R[R1]]referstothecontentsofthememory
loca,onwhoseloca,onisgiventhecontentsofregister1
(R1).
• AddressingModes:– Register.– Immediate
– Displacement
– Registerindirect,……..
18
AddressModes
AddressingModeImpacts
• InstrucFoncounts• ArchitectureComplexity
• CPI
19
20
Summaryofuseofmemoryaddressingmodes
21
Displacementvaluesarewidelydistributed
ImpactinstrucFonlength
DisplacementAddressingMode
• Benchmarksshow– 12bitsofdisplacementwouldcaptureabout75%ofthefull32-bit
displacements– 16bitsshouldcaptureabout99%
• Remember:– [email protected],thechoiceisatleast12-16bits
• Foraddressesthatdofitindisplacementsize:
Add R4,10000(R0)
• Foraddressesthatdon’tfitindisplacementsize,thecompilermustdo
thefollowing:
Load R1,1000000
AddR1,R0
Add R4,0(R1)
22
ImmediateAddressingMode
• Usedwherewewanttogettoanumericalvalueinan
instrucFon
• Around25%oftheopera@onshaveanimmediateoperand
23
Athighlevel:a=b+3;if(a>17)gotoAddr
AtAssemblerlevel:LoadR2,#3AddR0,R1,R2LoadR2,#17CMPBGTR1,R2LoadR1,AddressJump(R1)
About25%ofdatatransferandALUopera@onshaveanimmediateoperand
24
ImpactinstrucFonlength
Numberofbitsforimmediate
• 16bitswouldcaptureabout80%and8bitsabout50%.
25
ImpactinstrucFonlength
Summary:MemoryAddressing
• Anewarchitectureexpectedtosupportatleast:displacement,immediate,andregisterindirect– represent75%to99%oftheaddressingmodes
• Thesizeoftheaddressfordisplacementmodetobeat
least12-16bits– capture75%to99%ofthedisplacements
• Thesizeoftheimmediatefieldtobeatleast8-16bits– capture50%to80%oftheimmediates
Processorsrelyoncompilerstogeneratecodesusingthose
addressingmode
26
4TypeAndSizeofOperands
• Thetypeoftheoperandisusuallyencodedintheopcode– e.g.,LDB–loadbyte;LDW–loadword
• Commonoperandtypes:(implytheirsizes)
Character(8bitsor1byte)
Halfword(16bitsor2bytes)
Word(32bitsor4bytes)
Doubleword(64bitsor8bytes)
SingleprecisionfloaFngpoint(4bytesor1word)
DoubleprecisionfloaFngpoint(8bytesor2words)
ü CharactersarealmostalwaysinASCII
ü 16-bitUnicode(usedinJava)isgainingpopularityü Integersaretwo�scomplementbinary
ü Floa,ngpointsfollowtheIEEEstandard754• Somearchitecturessupportpackeddecimal:4bitsareusedto
encodethevalues0-9;2decimaldigitsarepackedintoeachbyte27
Howisthetypeofanoperanddesignated?
Distribu@onofdataaccessesbysize
28
Summary:TypeandSizeofoperands
• 32-architecturesupports8-,16-,and32-bitintegers,32-bitand64-bitIEEE754floaFng-pointdata.
• Anew64-bitaddressarchitecturesupports64-bitintegers• MediaprocessorandDSPsneedwideraccumulaFng
registersforaccuracy.
29
5Opera@onsintheInstruc@onSet
• AllcomputersgenerallyprovideafullsetofoperaFonsfor
thefirstthreecategories
• AllcomputersmusthavesomeinstrucFonsupportforbasic
systemfuncFons
• GraphicsinstrucFonstypicallyoperateonmanysmaller
dataitemsinparallel
30
Top10instruc@onsforthe80x86
31
6Instruc@onsforControlFlow
• ControlinstrucFonschangetheflowofcontrol:– insteadofexecuFngthenextinstrucFon,theprogram
branchestotheaddressspecifiedinthebranchinginstrucFons
• Theybreakthepipeline– DifficulttoopFmizeout
– ANDtheyarefrequent• FourtypesofcontrolinstrucFons
– Condi@onalbranches– Jumps–uncondi@onaltransfer– Procedurecalls– Procedurereturns
32
Breakdownofcontrolflowinstruc@ons
– Condi@onalbranches– Jumps–uncondi@onaltransfer– Procedurecalls– Procedurereturns
• Issues:– Whereisthetargetaddress?Howtospecifyit?– Whereisreturnaddresskept?Howaretheargumentspassed?
(calls)– Whereisreturnaddress?Howaretheresultspassed?(returns)
33
AddressingModesforControlFlowInstruc@ons
• PC-relaFve(ProgramCounter)
– supplyadisplacementaddedtothePC
• KnownatcompileFmeforjumps,branches,andcalls(specified
withintheinstrucFon)
– thetargetisoxennearthecurrentinstrucFon• requiringfewerbits• independentlyofwhereitisloaded(posiFonindependence)
• Registerindirectaddressing–dynamicaddressing
– ThetargetaddressmaynotbeknownatcompileFme
– Namingaregisterthatcontainsthetargetaddress
• Caseorswitchstatements
• VirtualfuncFonsormethodsinC++orJava
• High-orderfuncFonsorfuncFonpointersinCorC++• Dynamicallysharedlibraries
34
Branchdistances
35
Condi@onalBranchOp@ons
36
Figure2.21MajormethodsforevaluaFngbranchcondiFons
ComparisonTypevs.Frequency
• Mostloopsgofrom0ton.
• Mostbackwardbranches
areloops–takenabout
90%
37
Program % backward branches
% all control instructions that modify PC
gcc 26% 63% spice 31% 63% TeX 17% 70% Average 25% 65%
ProcedureInvoca@onOp@ons
• Procedurecallsandreturns– controltransfer– statesaving;thereturnaddressmustbesaved
Newerarchitecturesrequirethecompilertogeneratestoresandloads
foreachregistersavedandrestored
• TwobasicconvenFonsinusetosaveregisters– callersaving:thecallingproceduremustsavetheregistersthatit
wantspreservedforaccessaxerthecall
• thecalledprocedureneednotworryaboutregisters– calleesaving:thecalledproceduremustsavetheregistersitwantsto
use
• leavingthecallerunrestrained
mostrealsystemstodayuseacombinaFonofboth
• ApplicaFonbinaryinterface(ABI)thatsetdownthebasicrulesastowhichregisterbecallersavedandwhichshouldbecalleesaved
38
7.EncodinganInstruc@onSet
• Opcode:specifyingtheoperaFon• #ofoperand
– addressingmode
– addressspecifier:tellswhataddressingmodeisused
– Load-storecomputer
• Onlyonememoryoperand
• Onlyoneortwoaddressingmodes
• ThearchitecturemustbalancingseveralcompeFngforceswhen
encodingtheinstrucFonset:
– #ofregisters&&Addressingmodes
– Sizeofregisters&&Addressingmodefields;AverageinstrucFonsize&&Averageprogramsize.
– EasytohandleinpipelineimplementaFon.
39
x86vs.AlphaInstruc@onFormats
• x86:
• Alpha:
40
Instruc@onLengthTradeoffs
• Fixedlength:LengthofallinstrucFonsthesame
+EasiertodecodesingleinstrucFoninhardware
+EasiertodecodemulFpleinstrucFonsconcurrently
--WastedbitsininstrucFons(Whyisthisbad?)
--Harder-to-extendISA(howtoaddnewinstrucFons?)
• Variablelength:LengthofinstrucFonsdifferent(determinedbyopcodeandsub-opcode)
+Compactencoding(Whyisthisgood?)
Intel432:Huffmanencoding(sortof).6to321bitinstrucFons.How?
--MorelogictodecodeasingleinstrucFon
--HardertodecodemulFpleinstrucFonsconcurrently
• Tradeoffs– Codesize(memoryspace,bandwidth,latency)vs.hardwarecomplexity
– ISAextensibilityandexpressiveness
– Performance?Smallercodevs.imperfectdecode
41
UniformvsNon-uniformDecode
• Uniformdecode:SamebitsineachinstrucFoncorrespond
tothesamemeaning
– OpcodeisalwaysinthesamelocaFon
– immediatevalues,…
– Many�RISC�ISAs:Alpha,MIPS,SPARC
+Easierdecode,simplerhardware
+Enablesparallelism:generatetargetaddressbeforeknowingtheinstrucFon
isabranch
--RestrictsinstrucFonformat(fewerinstrucFons?)orwastesspace
• Non-uniformdecode
– E.g.,opcodecanbethe1st-7thbyteinx86+MorecompactandpowerfulinstrucFonformat
--Morecomplexdecodelogic
42
Threebasicvaria@oninstruc@onencoding:variablelength,fixedlength,andhybrid
43
Thelengthof80x86(CISC)instruc@onsvariesbetween1and17bytes.
ThelengthofmostRISCISAinstruc@onsare4bytes.
&&
X86programaregenerallysmallerthanRISCISA.
ToreduceRISCcodesize
ReducedCodeSizeinRISCs
• Hybridencoding–support16-bitand32-bitinstrucFonsinRISC,eg.ARMThumb,MIPS16andRISC-V
– narrowinstrucFonssupportfeweroperaFons,smalleraddressand
immediatefields,fewerregisters,andtwo-addressformatratherthanthe
classicthree-addressformat
– claimacodesizereducFonofupto40%
• CompressioninIBM’sCodePack– AddshardwaretodecompressinstrucFonsastheyarefetchedfrom
memoryonaninstrucFoncachemiss
– TheinstrucFoncachecontainsfull32-bitinstrucFons,butcompressed
codeiskeptinmainmemory,ROMs,andthedisk
– ClaimcodereducFon35%-40%
– PowerPCcreateaHashtableinmemorythatmapbetweencompressed
anduncompressedaddress.Codesize35%~40%
• Hitachi’sSuperH:fixed16-bitformat
– 16ratherthan32registers– fewerinstrucFons
44
SummaryofInstruc@onEncoding
• Threechoices– Variable,fixedandhybrid– Notethedifferencesofhybridandvariable
• ChoicesofinstrucFonencodingisatradeoffbetween– Forperformance:fixedencoding
– Forcodesize:variableencoding
• HowhybridencodingisusedinRISCtoreducecodesize– 16bitand32bit
• Ingeneral,wesee:– RISC:fixedorhybrid– CISC:variable
45
8TheRoleofCompilers
• Almostallprogrammingisdoneinhigh-levellanguages.
– AnISAisessenFallyacompliertarget.
• Seebackupslidesforthecompila,onstagebymost
compiler,e.g.gcc
• Compilergoals:
– Allcorrectprogramsexecutecorrectly
– Mostcompiledprogramsexecutefast(opFmizaFons)
– FastcompilaFon
– Debuggingsupport
46
TypicalModernCompilerStructure
47
Figure A.19 Compilers typically consist of two to four passes, with more highly optimizing compilers having more passes. This structure maximizes the probability that a program compiled at various levels of optimization will produce the same output when given the same input. The optimizing passes are designed to be optional and may be skipped when faster compilation is the goal and lower-quality code is acceptable. A pass is simply one phase in which the compiler reads and transforms the entire program. (The term phase is often used inter-changeably with pass.) Because the optimizing passes are separated, multiple languages can use the same optimizing and code generation passes. Only a new front end is required for a new language.
Op@miza@onTypes
• Highlevel–doneatornearsourcecodelevel– Ifprocedureiscalledonlyonce,putitin-lineandsaveCALL– moregeneralcase:ifcall-count<somethreshold,putthemin-line
• Local–donewithinstraight-linecode– commonsub-expressionsproducesamevalue–eitherallocatea
registerorreplacewithsinglecopy
– constantpropagaFon–replaceconstantvaluedvariablewiththeconstant
– stackheightreducFon–re-arrangeexpressiontreetominimize
temporarystorageneeds
• Global–acrossabranch– copypropagaFon–replaceallinstancesofavariableAthathas
beenassignedX(i.e.,A=X)withX.
– codemoFon–removecodefromaloopthatcomputessamevalue
eachiteraFonoftheloopandputitbeforetheloop
– simplifyoreliminatearrayaddressingcalculaFonsinloops
48
Op@miza@onTypes
• Machine-dependentopFmizaFons–basedonmachine
knowledge
– strengthreducFon–replacemulFplybyaconstantwithshixs
andadds
• wouldmakesenseiftherewasnohardwaresupportfor
MUL
• atrickierversion:17×=arithmeFclexshix4andadd
• pipeliningscheduling–reorderinstrucFonstoimprove
pipelineperformance
– dependencyanalysis– branchoffsetopFmizaFon-reordercodetominimizebranch
offsets
49
Majortypesofop@miza@ons
50
ComplierOp@miza@ons–ChangeinIC
51
• L0–unopFmized
• L1–localopts,codescheduling,&localreg.allocaFon• L2–globaloptsandlooptransformaFons,&globalreg.AllocaFon
• L3–procedureintegraFon
gcc-O2hello.c-ohello
CompilerBasedRegisterOp@miza@on
• Compilerassumessmallnumberofregisters(16-32)– OpFmizinguseisuptocompiler
– HLLprogramshavenoexplicitreferencestoregisters
– usually–isthisalwaystrue?
• CompilerApproach
– Assignsymbolicorvirtualregistertoeachcandidatevariable
– Map(unlimited)symbolicregisterstorealregisters
– Symbolicregistersthatdonotoverlapcansharerealregisters
– Ifyourunoutofrealregisterssomevariables
• Spilling
GraphColoring
• Givenagraphofnodesandedges– Assignacolortoeachnode– Adjacentnodeshavedifferentcolors– Useminimumnumberofcolors
• RegistraFonallocaFon– Nodesaresymbolicregisters
– Tworegistersthatareliveinthesameprogramfragmentare
joinedbyanedge
– Trytocolorthegraphwithncolors,wherenisthenumberof
realregisters
– Nodesthatcannotbecoloredareplacedinmemory
h]ps://en.wikipedia.org/wiki/Graph_coloring
Iron-codeSummary
• Sec-onA.2—Usegeneral-purposeregisterswithaload-storearchitecture.• Sec-onA.3—Supporttheseaddressingmodes:displacement(withanaddressoffset
sizeof12to16bits),immediate(size8to16bits),andregisterindirect.• Sec-onA.4—Supportthesedatasizesandtypes:8-,16-,32-,and64-bitintegersand
64-bitIEEE754floa@ng-pointnumbers.– Nowwesee16-bitFPfordeeplearninginGPU
• hup://www.nextplavorm.com/2016/09/13/nvidia-pushes-deep-learning-inference-new-pascal-gpus/
• Sec-onA.5—Supportthesesimpleinstruc@ons,sincetheywilldominatethenumberofinstruc@onsexecuted:load,store,add,subtract,moveregister-register,andshiR.
• Sec-onA.6—Compareequal,comparenotequal,compareless,branch(withaPC-rela@veaddressatleast8bitslong),jump,call,andreturn.
• Sec-onA.7—Usefixedinstruc@onencodingifinterestedinperformance,andusevariableinstruc@onencodingifinterestedincodesize.
• Sec-onA.8—Provideatleast16general-purposeregisters,besurealladdressingmodesapplytoalldatatransferinstruc@ons,andaimforaminimalistIS
– ORenuseseparatefloa@ng-pointregisters.– Thejus@fica@onistoincreasethetotalnumberofregisterswithoutraisingproblems
intheinstruc@onfor-matorinthespeedofthegeneral-purposeregisterfile.Thiscompromise,however,isnotorthogonal.
54
RealWorldISA
55
Thedetailsistotrade-off
56