CSC 631: High-Performance Computer...
Transcript of CSC 631: High-Performance Computer...
CSC631:High-PerformanceComputerArchitecture
Spring2017Lecture6:Out-of-OrderProcessors
Supercomputers
Definitionsofasupercomputer:§ Fastestmachineinworldatgiventask§ Adevicetoturnacompute-boundproblemintoanI/Oboundproblem
§ Anymachinecosting$30M+§ AnymachinedesignedbySeymourCray
§ CDC6600(Cray,1964)regardedasfirstsupercomputer
2
CDC6600SeymourCray,1963
§ Afastpipelinedmachinewith60-bitwords- 128Kwordmainmemorycapacity,32banks
§ Tenfunctionalunits(parallel,unpipelined)- FloatingPoint:adder,2multipliers,divider- Integer:adder,2incrementers,...
§ Hardwiredcontrol(nomicrocoding)§ Scoreboard fordynamicschedulingofinstructions§ TenPeripheralProcessorsforInput/Output
- afastmulti-threaded12-bitintegerALU§ Veryfastclock,10MHz(FPaddin4clocks)§ >400,000transistors,750sq.ft.,5tons,150kW,novelfreon-basedtechnologyforcooling
§ Fastestmachineinworldfor5years(until7600)- over100sold($7-10Meach)
33/10/2009
CDC6600:ALoad/StoreArchitecture
4
• Separateinstructionstomanipulatethreetypesofreg.• 8x60-bitdataregisters(X)• 8x18-bitaddressregisters(A)• 8x18-bitindexregisters(B)
• Allarithmeticandlogicinstructionsareregister-to-register
•OnlyLoadandStoreinstructionsrefertomemory!
Touchingaddressregisters1to5initiatesaload6to7initiatesastore
- veryusefulforvectoroperations
opcode i j k Ri ¬ Rj op Rk
opcode i j disp Ri ¬ M[Rj + disp]
6 3 3 3
6 3 3 18
CDC6600:Datapath
5
AddressRegsIndexRegs8x18-bit8x18-bit
OperandRegs8x60-bit
Inst.Stack8x60-bit
IR
10FunctionalUnits
CentralMemory128Kwords,32banks,1µscycle
resultaddr
result
operand
operandaddr
CDC6600ISAdesignedtosimplifyhigh-performanceimplementation
§ Useofthree-address,register-registerALUinstructionssimplifiespipelinedimplementation- Only3-bitregisterspecifier fieldscheckedfordependencies- Noimplicitdependenciesbetweeninputsandoutputs
§ Decouplingsettingofaddressregister(Ar)fromretrievingvaluefromdataregister(Xr)simplifiesprovidingmultipleoutstandingmemoryaccesses- Softwarecanscheduleloadofaddressregisterbeforeuseofvalue- Caninterleaveindependentinstructionsinbetween
§ CDC6600hasmultipleparallelbutunpipelined functionalunits- E.g.,2separatemultipliers
§ Follow-onmachineCDC7600usedpipelinedfunctionalunits- ForeshadowslaterRISCdesigns
6
CDC6600:VectorAddition
7
B0←- nloop: JZEB0,exit
A0←B0+a0 loadX0A1←B0+b0 loadX1X6←X0+X1A6←B0+c0 storeX6B0←B0+1jumploop
Ai=addressregisterBi=indexregisterXi=dataregister
CDC6600Scoreboard
§ Instructionsdispatchedin-ordertofunctionalunitsprovidednostructuralhazardorWAW- Stallonstructuralhazard,nofunctionalunitsavailable-Onlyonependingwritetoanyregister
§ Instructionswaitforinputoperands(RAWhazards)beforeexecution- Canexecuteout-of-order
§ Instructionswaitforoutputregistertobereadbyprecedinginstructions(WAR)- Resultheldinfunctionalunituntilregisterfree
8
9[©IBM]
IBM360/91Floating-PointUnitR.M.Tomasulo,1967
10
Mult
1
123456
loadbuffers(frommemory)
1234
Adder
123
Floating-PointRegfile
storebuffers(tomemory)
...
instructions
Commonbusensuresthatdataismadeavailableimmediatelytoalltheinstructionswaitingforit.Matchtag,ifequal,copyvalue&setpresence“p”.
Distributereservationstationstofunctionalunits
<tag,result>
p tag/datap tag/datap tag/data
p tag/datap tag/datap tag/data
p tag/datap tag/datap tag/data
p tag/datap tag/data
p tag/datap tag/data2
p tag/datap tag/datap tag/data
p tag/datap tag/datap tag/data
p tag/datap tag/datap tag/datap tag/data
Out-of-OrderFadesintoBackgroundOut-of-orderprocessingimplementedcommerciallyin1960s,butdisappearedagainuntil1990sastwomajorproblemshadtobesolved:§ Precisetraps- ImprecisetrapscomplicatedebuggingandOScode-Note,preciseinterruptsarerelativelyeasytoprovide
§ Branchprediction- Amountofexploitableinstruction-levelparallelism(ILP)limitedbycontrolhazards
Also,simplermachinedesignsinnewtechnologybeatcomplicatedmachinesinoldtechnology- Bigadvantagetofitprocessor &cachesononechip-Microprocessorshaderaof1%/weekperformancescaling
11
SeparatingCompletionfromCommit
§ Re-orderbufferholdsregisterresultsfromcompletionuntilcommit- Entriesallocatedinprogramorderduringdecode- Bufferscompletedvaluesandexceptionstateuntilin-ordercommitpoint
- Completedvaluescanbeusedbydependentsbeforecommitted(bypassing)
- Eachentryholdsprogramcounter,instructiontype,destinationregisterspecifier andvalueifany,andexceptionstatus(infooftencompressedtosavehardware)
§ Memoryreorderingneedsspecialdatastructures- Speculativestoreaddressanddatabuffers- Speculativeloadaddressanddatabuffers
12
In-OrderCommitforPreciseTraps
§ In-orderinstructionfetchanddecode,anddispatchtoreservationstationsinsidereorderbuffer
§ Instructionsissuefromreservationstationsout-of-order§ Out-of-ordercompletion,valuesstoredintemporarybuffers§ Commitisin-order,checksfortraps,andifnoneupdatesarchitecturalstate
13
Fetch Decode
Execute
CommitReorderBuffer
In-order In-orderOut-of-order
Trap?Kill
Kill Kill
InjecthandlerPC
PhasesofInstructionExecution
14
Fetch: Instructionbitsretrievedfrominstructioncache.I-cache
FetchBuffer
IssueBuffer
FunctionalUnits
ArchitecturalState
Execute:Instructionsandoperands issuedtofunctionalunits.Whenexecutioncompletes,allresultsandexceptionflagsareavailable.
Decode: Instructions dispatchedtoappropriateissue buffer
ResultBufferCommit:Instructionirrevocablyupdatesarchitecturalstate(aka“graduation”),ortakesprecisetrap/interrupt.
PC
Commit
Decode/Rename
In-OrderversusOut-of-OrderPhases
§ Instructionfetch/decode/renamealwaysin-order-NeedtoparseISAsequentiallytogetcorrectsemantics- ProposalsforspeculativeOoO instructionfetch,e.g.,Multiscalar.Predictcontrolflowanddatadependenciesacrosssequential programsegmentsfetched/decoded/executedinparallel,fixup ifpredictionwrong
§ Dispatch(placeinstructionintomachinebufferstowaitforissue)alsoalwaysin-order- Someuse“Dispatch”tomeanissue,butnotintheselectures
15
In-OrderVersusOut-of-OrderIssue
§ In-orderissue:- IssuestallsonRAWdependenciesorstructuralhazards,orpossiblyWAR/WAWhazards
- Instructioncannotissuetoexecutionunitsunlessallprecedinginstructionshaveissuedtoexecutionunits
§ Out-of-orderissue:- Instructionsdispatchedinprogramordertoreservationstations(orotherformsofinstructionbuffer)towaitforoperandstoarrive,orotherhazardstoclear
-Whileearlierinstructionswaitinissuebuffers,followinginstructionscanbedispatchedandissuedout-of-order
16
In-OrderversusOut-of-OrderCompletion
§ Allbutthesimplestmachineshaveout-of-ordercompletion,duetodifferentlatenciesoffunctionalunitsanddesiretobypassvaluesassoonasavailable
§ ClassicRISC5-stageintegerpipelinejustbarelyhasin-ordercompletion- Loadtakestwocycles,butfollowingone-cycleintegeropcompletesatsametime, notearlier
- AddingpipelinedFPUimmediatelybringsOoO completion
17
In-OrderversusOut-of-OrderCommit
§ In-ordercommitsupportsprecisetraps,standardtoday- Someproposalstoreducethecostofin-ordercommitbyretiringsomeinstructionsearlytocompactreorderbuffer,butthisisjustanoptimizedin-ordercommit
§ Out-of-ordercommitwaseffectivelywhatearlyOoOmachinesimplemented(imprecisetraps)ascompletionirrevocablychangedmachinestate- i.e.,complete==commitinthesemachines
18
OoO DesignChoices
§ Wherearereservationstations?- Partofreorderbuffer,orinseparateissuewindow?- Distributedbyfunctionalunits,orcentralized?
§ Howisregisterrenamingperformed?- Tagsanddataheldinreservationstations,withseparatearchitecturalregisterfile
- Tagsonlyinreservationstations,dataheldinunifiedphysicalregisterfile
19
“Data-in-ROB”Design(HPPA8000,PentiumPro,Core2Duo,Nehalem)
§ Managedascircularbufferinprogramorder,newinstructionsdispatchedtofreeslots,oldestinstructioncommitted/reclaimedwhendone(“p”bitsetonresult)
§ TagisgivenbyindexinROB(Freepointervalue)§ Indispatch,non-busysourceoperandsreadfromarchitecturalregisterfileandcopiedtoSrc1andSrc2withpresencebit“p”set.Busyoperandscopytagofproducerandclear“p”bit.
§ Setvalidbit“v”ondispatch,setissuedbit“i”onissue§ Oncompletion,searchsourcetags,set“p”bitandcopydataintosrc ontagmatch.WriteresultandexceptionflagstoROB.
§ Oncommit,checkexceptionstatus,andcopyresultintoarchitecturalregisterfileifnotrap.
§ Ontrap,flushmachineandROB,setfree=oldest,jumptohandler
Tagp Src1 Tagp Src2 Regp Result Except?iv OpcodeTagp Src1 Tagp Src2 Regp Result Except?iv OpcodeTagp Src1 Tagp Src2 Regp Result Except?iv OpcodeTagp Src1 Tagp Src2 Regp Result Except?iv OpcodeTagp Src1 Tagp Src2 Regp Result Except?iv Opcode
Oldest
Free
ManagingRenameforData-in-ROB
§ If“p”bitset,thenusevalueinarchitecturalregisterfile§ Else,tagfieldindicatesinstructionthatwill/hasproducedvalue§ Fordispatch,readsourceoperands<p,tag,value>fromarch.regfile,andalsoread<p,result>fromproducinginstructioninROB,bypassingasneeded.CopytoROB
§ Writedestinationarch.registerentrywith<0,Free,_>,toassigntagtoROBindexofthisinstruction
§ Oncommit,updatearch.regfile with<1,_,Result>§ Ontrap,resettable(Allp=1)
21
Tagp ValueTagp ValueTagp Value
Tagp Value
Oneentryperarch.register
Renametableassociatedwitharchitecturalregisters,managedindecode/dispatch
ROB
DataMovementinData-in-ROBDesign
22
ArchitecturalRegisterFile
Readoperandsduringdecode
SourceOperands
Writesourcesindispatch
Readoperandsatissue
FunctionalUnits
Writeresultsatcompletion
Readresultsforcommit
Bypassnewervaluesatdispatch
ResultData
Writeresultsatcommit
UnifiedPhysicalRegisterFile(MIPSR10K,Alpha21264,IntelPentium4&Sandy/IvyBridge)
§ Renameallarchitecturalregistersintoasinglephysicalregisterfileduringdecode,noregistervaluesread
§ Functionalunitsreadandwritefromsingleunifiedregisterfileholdingcommittedandtemporaryregistersinexecute
§ Commitonlyupdatesmappingofarchitecturalregistertophysicalregister,nodatamovement
23
UnifiedPhysicalRegisterFile
Readoperandsatissue
FunctionalUnits
Writeresultsatcompletion
CommittedRegisterMapping
DecodeStageRegisterMapping
LifetimeofPhysicalRegisters
24
ld x1, (x3)addi x3, x1, #4sub x6, x7, x9add x3, x3, x6ld x6, (x1)add x6, x6, x3sd x6, (x1)ld x6, (x11)
ld P1, (Px)addi P2, P1, #4sub P3, Py, Pzadd P4, P2, P3ld P5, (P1)add P6, P5, P4sd P6, (P1)ld P7, (Pw)
Rename
Whencanwereuseaphysicalregister?Whennextwriterofsamearchitecturalregistercommits
• Physicalregfile holdscommittedandspeculativevalues• PhysicalregistersdecoupledfromROBentries(nodatainROB)
PhysicalRegisterManagement
25
op p1 PR1 p2 PR2exuse Rd PRdLPRd
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
x5P5x6P6x7
x0P8x1
x2P7x3
x4
ROB
Rename Table
Physical Regs Free List
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
ppp
P0P1P3P2P4
(LPRd requires third read port on Rename Table for each instruction)
<x1>P8 p
PhysicalRegisterManagement
26
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<x1>P8 p
x ld p P7 x1 P0
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8
PhysicalRegisterManagement
27
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<R1>P8 p
x ld p P7 x1 P0
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8P7
P1
x addi P0 x3 P1
PhysicalRegisterManagement
28
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<R1>P8 p
x ld p P7 x1 P0
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8P7
P1
x addi P0 x3 P1P5
P3
x sub p P6 p P5 x6 P3
PhysicalRegisterManagement
29
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<x1>P8 p
x ld p P7 x1 P0
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8P7
P1
x addi P0 x3 P1P5
P3
x sub p P6 p P5 x6 P3P1
P2
x add P1 P3 x3 P2
PhysicalRegisterManagement
30
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<x1>P8 p
x ld p P7 x1 P0
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8P7
P1
x addi P0 x3 P1P5
P3
x sub p P6 p P5 x6 P3P1
P2
x add P1 P3 x3 P2x ld P0 x6 P4P3
P4
PhysicalRegisterManagement
31
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
x ld p P7 x1 P0x addi P0 x3 P1x sub p P6 p P5 x6 P3
x ld p P7 x1 P0
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<x1>P8 p
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8P7
P1
P5
P3
P1
P2
x add P1 P3 x3 P2x ld P0 x6 P4P3
P4
Execute & Commitp
p
p<x1>
P8
x
PhysicalRegisterManagement
32
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
x sub p P6 p P5 x6 P3x addi P0 x3 P1x addi P0 x3 P1
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
P8
x x ld p P7 x1 P0
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8P7
P1
P5
P3
P1
P2
x add P1 P3 x3 P2x ld P0 x6 P4P3
P4
Execute & Commitp
p
p<x1>
P8
x
p
p<x3>
P7
MIPSR10KTrapHandling
§ Renametableisrepairedbyunrenaming instructionsinreverseorderusingthePRd/LPRd fields
§ TheAlpha21264hadsimilarphysicalregisterfilescheme,butkeptcompleterenametablesnapshotsforeachinstructioninROB(80snapshotstotal)- Flashcopyallbitsfromsnapshottoactivetableinonecycle
33
PartII:AdvancedOut-of-OrderSuperscalarDesigns
34
ROB
DataMovementinData-in-ROBDesign
35
ArchitecturalRegisterFile
Readoperandsduringdecode
SourceOperands
Writesourcesindispatch
Readoperandsatissue
FunctionalUnits
Writeresultsatcompletion
Readresultsforcommit
Bypassnewervaluesatdispatch
ResultData
Writeresultsatcommit
UnifiedPhysicalRegisterFile(MIPSR10K,Alpha21264,IntelPentium4&Sandy/IvyBridge)
§ Renameallarchitecturalregistersintoasinglephysicalregisterfileduringdecode,noregistervaluesread
§ Functionalunitsreadandwritefromsingleunifiedregisterfileholdingcommittedandtemporaryregistersinexecute
§ Commitonlyupdatesmappingofarchitecturalregistertophysicalregister,nodatamovement
36
UnifiedPhysicalRegisterFile
Readoperandsatissue
FunctionalUnits
Writeresultsatcompletion
CommittedRegisterMapping
DecodeStageRegisterMapping
LifetimeofPhysicalRegisters
37
ld x1, (x3)addi x3, x1, #4sub x6, x7, x9add x3, x3, x6ld x6, (x1)add x6, x6, x3sd x6, (x1)ld x6, (x11)
ld P1, (Px)addi P2, P1, #4sub P3, Py, Pzadd P4, P2, P3ld P5, (P1)add P6, P5, P4sd P6, (P1)ld P7, (Pw)
Rename
Whencanwereuseaphysicalregister?Whennextwriterofsamearchitecturalregistercommits
• Physicalregfile holdscommittedandspeculativevalues• PhysicalregistersdecoupledfromROBentries(nodatainROB)
PhysicalRegisterManagement
38
op p1 PR1 p2 PR2exuse Rd PRdLPRd
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
x5P5x6P6x7
x0P8x1
x2P7x3
x4
ROB
Rename Table
Physical Regs Free List
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
ppp
P0P1P3P2P4
(LPRd requires third read port on Rename Table for each instruction)
<x1>P8 p
PhysicalRegisterManagement
39
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<x1>P8 p
x ld p P7 x1 P0
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8
PhysicalRegisterManagement
40
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<R1>P8 p
x ld p P7 x1 P0
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8P7
P1
x addi P0 x3 P1
PhysicalRegisterManagement
41
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<R1>P8 p
x ld p P7 x1 P0
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8P7
P1
x addi P0 x3 P1P5
P3
x sub p P6 p P5 x6 P3
PhysicalRegisterManagement
42
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<x1>P8 p
x ld p P7 x1 P0
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8P7
P1
x addi P0 x3 P1P5
P3
x sub p P6 p P5 x6 P3P1
P2
x add P1 P3 x3 P2
PhysicalRegisterManagement
43
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<x1>P8 p
x ld p P7 x1 P0
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8P7
P1
x addi P0 x3 P1P5
P3
x sub p P6 p P5 x6 P3P1
P2
x add P1 P3 x3 P2x ld P0 x6 P4P3
P4
PhysicalRegisterManagement
44
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
x ld p P7 x1 P0x addi P0 x3 P1x sub p P6 p P5 x6 P3
x ld p P7 x1 P0
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
<x1>P8 p
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8P7
P1
P5
P3
P1
P2
x add P1 P3 x3 P2x ld P0 x6 P4P3
P4
Execute & Commitp
p
p<x1>
P8
x
PhysicalRegisterManagement
45
op p1 PR1 p2 PR2exuse Rd PRdLPRdROB
x sub p P6 p P5 x6 P3x addi P0 x3 P1x addi P0 x3 P1
ld x1, 0(x3)addi x3, x1, #4sub x6, x7, x6add x3, x3, x6ld x6, 0(x1)
Free ListP0P1P3P2P4
<x6>P5<x7>P6<x3>P7
P0
Pn
P1P2P3P4
Physical Regs
ppp
P8
x x ld p P7 x1 P0
x5P5x6P6x7
x0P8x1
x2P7x3
x4
Rename Table
P0
P8P7
P1
P5
P3
P1
P2
x add P1 P3 x3 P2x ld P0 x6 P4P3
P4
Execute & Commitp
p
p<x1>
P8
x
p
p<x3>
P7
MIPSR10KTrapHandling
§ Renametableisrepairedbyunrenaming instructionsinreverseorderusingthePRd/LPRd fields
§ TheAlpha21264hadsimilarphysicalregisterfilescheme,butkeptcompleterenametablesnapshotsforeachinstructioninROB(80snapshotstotal)- Flashcopyallbitsfromsnapshottoactivetableinonecycle
46
ReorderBufferHoldsActiveInstructions(DecodedbutnotCommitted)
47
(Olderinstructions)
(Newerinstructions)
Cyclet
…ld x1, (x3)add x3, x1, x2sub x6, x7, x9add x3, x3, x6ld x6, (x1)add x6, x6, x3sd x6, (x1)ld x6, (x1)…
Commit
Fetch
Cyclet+1
Execute
…ld x1, (x3)add x3, x1, x2sub x6, x7, x9add x3, x3, x6ld x6, (x1)add x6, x6, x3sd x6, (x1)ld x6, (x1)…
ROBcontents
SeparateIssueWindowfromROB
48
Reorderbufferusedtoholdexceptioninformationforcommit.
Theissuewindowholdsonlyinstructionsthathavebeendecodedandrenamedbutnotissuedintoexecution.Hasregistertagsandpresencebits,andpointertoROBentry.
op p1 PR1 p2 PR2 PRduse ex ROB#
ROBisusuallyseveraltimeslargerthanissuewindow– why?
Rd LPRd PC Except?Oldest
Free
Done?
SuperscalarRegisterRenaming§ Duringdecode,instructionsallocatednewphysicaldestinationregister§ Sourceoperandsrenamedtophysicalregisterwithnewestvalue§ Executionunitonlyseesphysicalregisternumbers
49
RenameTable
Op Src1 Src2Dest Op Src1 Src2Dest
RegisterFreeList
Op PSrc1 PSrc2PDestOp PSrc1 PSrc2PDest
UpdateMapping
Doesthiswork?
Inst1 Inst2
ReadAddresses
ReadDataWrite
Ports
SuperscalarRegisterRenaming
50
RenameTable
Op Src1 Src2Dest Op Src1 Src2Dest
RegisterFreeList
Op PSrc1 PSrc2PDestOp PSrc1 PSrc2PDest
UpdateMapping
Inst1 Inst2
ReadAddresses
ReadDataWrite
Ports =?=?
MustcheckforRAWhazardsbetweeninstructionsissuinginsamecycle.Canbedoneinparallelwithrenamelookup.
MIPSR10Krenames4serially-RAW-dependentinsts/cycle
ControlFlowPenalty
51
I-cache
Fetch Buffer
IssueBuffer
Func.Units
Arch.State
Execute
Decode
ResultBuffer Commit
PC
Fetch
Branchexecuted
Next fetch started
Modernprocessorsmayhave>10pipelinestagesbetweennextPCcalculationandbranchresolution!
Howmuchworkislostifpipelinedoesn’tfollowcorrectinstructionflow?
~Looplengthxpipelinewidth+buffers
ReducingControlFlowPenalty
§ Softwaresolutions- Eliminatebranches- loopunrolling- Increasestherunlength
- Reduceresolutiontime- instructionscheduling- Computethebranchconditionasearlyaspossible(oflimitedvaluebecausebranchesoftenincriticalpaththroughcode)
§ Hardwaresolutions- Findsomethingelsetodo- delayslots- Replacespipelinebubbleswithusefulwork(requiressoftwarecooperation)– quicklyseediminishingreturns
- Speculate- branchprediction- Speculativeexecutionofinstructionsbeyondthebranch- Manyadvancesinaccuracy
52
BranchPrediction
53
Motivation:BranchpenaltieslimitperformanceofdeeplypipelinedprocessorsModernbranchpredictorshavehighaccuracy(>95%)andcanreducebranchpenaltiessignificantly
Requiredhardwaresupport:Predictionstructures:
• Branchhistorytables,branchtargetbuffers,etc.
Mispredict recoverymechanisms:• Keepresultcomputationseparatefromcommit• Killinstructionsfollowingbranchinpipeline• Restorestatetothatfollowingbranch
ImportanceofBranchPrediction
§ Consider4-waysuperscalarwith8pipelinestagesfromfetchtodispatch, and80-entryROB,and3cyclesfromissuetobranchresolution
§ Onamispredict,couldthrowaway8*4+(80-1)=111instructions
§ Improvingfrom90%to95%predictionaccuracy,removes50%ofbranchmispredicts- If1/6instructionsarebranches,thenmovefrom60instructionsbetweenmispredicts,to120instructionsbetweenmispredicts
54
StaticBranchPrediction
55
Overallprobabilityabranchistakenis~60-70%but:
ISAcanattachpreferreddirectionsemanticstobranches,e.g.,MotorolaMC88110
bne0 (preferredtaken) beq0 (nottaken)
ISAcanallowarbitrarychoiceofstaticallypredicteddirection,e.g.,HPPA-RISC,IntelIA-64
typicallyreportedas~80%accurate
backward90%
forward50%
DynamicBranchPredictionlearningbasedonpastbehavior
§ Temporalcorrelation- Thewayabranchresolvesmaybeagoodpredictorofthewayitwillresolveatthenextexecution
§ Spatialcorrelation- Severalbranchesmayresolveinahighlycorrelatedmanner(apreferredpathofexecution)
56
One-BitBranchHistoryPredictor
§ Foreachbranch,rememberlastwaybranchwent§ Hasproblemwithloop-closingbackwardbranches,astwomispredicts occuroneveryloopexecution1. firstiterationpredictsloopbackwardsbranchnot-taken
(loopwasexitedlasttime)2. lastiterationpredictsloopbackwardsbranchtaken(loop
continuedlasttime)
57
BranchPredictionBits
58
• Assume2BPbitsperinstruction• Changethepredictionaftertwoconsecutivemistakes!
¬takewrongtaken
¬taken
taken
taken
taken¬takeright
takeright
takewrong
¬taken
¬taken¬taken
BPstate:(predict take/¬take)x(lastprediction right/wrong)
BranchHistoryTable(BHT)
59
4K-entryBHT,2bits/entry,~80-90%correctpredictions
0 0FetchPC
Branch? TargetPC
+
I-Cache
Opcode offsetInstruction
kBHTIndex
2k-entryBHT,2bits/entry
Taken/¬Taken?
ExploitingSpatialCorrelationYehandPatt,1992
60
Historyregister,H,recordsthedirectionofthelastNbranchesexecutedbytheprocessor
if (x[i] < 7) theny += 1;
if (x[i] < 5) thenc -= 4;
Iffirstconditionfalse,secondconditionalsofalse
Two-LevelBranchPredictor
61
PentiumProusestheresultfromthelasttwobranchestoselectoneofthefoursetsofBHTbits(~95%correct)
0 0
kFetchPC
ShiftinTaken/¬Takenresultsofeachbranch
2-bitglobalbranchhistoryshiftregister
Taken/¬Taken?
SpeculatingBothDirections
§ Analternativetobranchpredictionistoexecutebothdirectionsofabranchspeculatively- resourcerequirementisproportionaltothenumberofconcurrentspeculativeexecutions
- onlyhalftheresourcesengageinusefulworkwhenbothdirectionsofabranchareexecutedspeculatively
- branchpredictiontakeslessresourcesthanspeculativeexecutionofbothpaths
§ Withaccuratebranchprediction,itismorecosteffectivetodedicateallresourcestothepredicteddirection!
62
LimitationsofBHTs
63
Onlypredictsbranchdirection.Therefore,cannotredirectfetchstreamuntilafterbranchtargetisdetermined.
UltraSPARC-IIIfetchpipeline
Correctlypredictedtakenbranch
penalty
JumpRegisterpenalty
A PCGeneration/MuxP InstructionFetchStage1F InstructionFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstructionstoFunctionalunitsR RegisterFileReadE IntegerExecute
Remainderofexecutepipeline(+another6stages)
BranchTargetBuffer(BTB)
64
• KeepboththebranchPCandtargetPCintheBTB• PC+4isfetchedifmatchfails• Onlytaken branchesandjumpsheldinBTB• NextPCdeterminedbefore branchfetchedanddecoded
2k-entry direct-mapped BTB(can also be associative)
I-Cache PC
k
Valid
valid
EntryPC
=
match
predicted
target
targetPC
CombiningBTBandBHT
§ BTBentriesareconsiderablymoreexpensivethanBHT,butcanredirectfetchesatearlierstageinpipelineandcanaccelerateindirectbranches(JR)
§ BHTcanholdmanymoreentriesandismoreaccurate
65
A PCGeneration/MuxP InstructionFetchStage1F InstructionFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstructionstoFunctionalunitsR RegisterFileReadE IntegerExecute
BTB
BHTBHTinlaterpipelinestagecorrectswhenBTBmissesapredictedtakenbranch
BTB/BHTonlyupdatedafterbranchresolvesinEstage
UsesofJumpRegister(JR)
§ Switchstatements(jumptoaddressofmatchingcase)
§ Dynamicfunctioncall(jumptorun-timefunctionaddress)
§ Subroutinereturns(jumptoreturnaddress)
66
HowwelldoesBTBworkforeachofthesecases?
BTBworkswellifsamecaseusedrepeatedly
BTBworkswellifsamefunctionusuallycalled,(e.g.,inC++programming,whenobjectshavesametypeinvirtualfunctioncall)
BTBworkswellifusuallyreturntothesameplaceÞ Oftenonefunctioncalledfrommanydistinctcallsites!
SubroutineReturnStack
SmallstructuretoaccelerateJRforsubroutinereturns,typicallymuchmoreaccuratethanBTBs.
67
&fb()&fc()
Pushcalladdresswhenfunctioncallexecuted
Popreturnaddresswhensubroutinereturndecoded
fa() { fb(); }fb() { fc(); }fc() { fd(); }
&fd() kentries(typicallyk=8-16)
ReturnStackinPipeline
§ Howtousereturnstack(RS)indeepfetchpipeline?§ Onlyknowifsubroutinecall/returnatdecode
68
A PCGeneration/MuxP InstructionFetchStage1F InstructionFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstructionstoFunctionalunitsR RegisterFileReadE IntegerExecute
RSRSPush/Popafterdecodegiveslargebubbleinfetchstream.
ReturnStackpredictionchecked
ReturnStackinPipeline
§ CanrememberwhetherPCissubroutinecall/returnusingBTB-likestructure
§ Insteadoftarget-PC,juststorepush/popbit
69
A PCGeneration/MuxP InstructionFetchStage1F InstructionFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstructionstoFunctionalunitsR RegisterFileReadE IntegerExecute
RS
Push/Popbeforeinstructionsdecoded!
ReturnStackpredictionchecked
In-Ordervs.Out-of-OrderBranchPrediction
Fetch
Decode
Execute
Commit
In-OrderIssue Out-of-OrderIssue
Fetch
Decode
Execute
Commit
ROB
Br.Pred.
Resolve
Br.Pred.
Resolve
§ Speculativefetchbutnotspeculativeexecution- branchresolvesbeforelaterinstructionscomplete
§ Completedvaluesheldinbypassnetworkuntilcommit
§ Speculativeexecution,withbranchesresolvedafterlaterinstructionscomplete
§ CompletedvaluesheldinrenameregistersinROBorunifiedphysicalregisterfileuntilcommit
• Bothstylesofmachinecanusesamebranchpredictorsinfront-endfetchpipeline,andbothcanexecutemultipleinstructionspercycle
• Commontohave10-30pipelinestagesineitherstyleofdesign
In-Order
In-Order
In-Order
Out-of-Order
70
InO vs.OoO Mispredict Recovery
§ In-orderexecution?- Designsonoinstructionissuedafterbranchcanwrite-backbeforebranchresolves
- Killallinstructionsinpipelinebehindmispredicted branch§ Out-of-orderexecution?-Multipleinstructionsfollowingbranchinprogramordercancompletebeforebranchresolves
- Asimplesolutionwouldbetohandlelikeprecisetraps- Problem?
71
BranchMispredictioninPipeline
§ CanhavemultipleunresolvedbranchesinROB§ Canresolvebranchesout-of-orderbykillingalltheinstructionsinROBthatfollowamispredicted branch
§ MIPSR10Kusesfourmaskbitstotaginstructionsthataredependentonuptofourspeculativebranches
§ Maskbitsclearedasbranchresolves,andreusedfornextbranch72
Fetch Decode
Execute
CommitReorder Buffer
Kill
Kill Kill
BranchResolution
Inject correct PC
BranchPrediction
PC
Complete
RenameTableRecovery
§ Havetoquicklyrecoverrenametableonbranchmispredicts
§ MIPSR10Konlyhasfoursnapshotsforeachoffouroutstandingspeculativebranches
§ Alpha21264has80snapshots,oneperROBinstruction
73
ImprovingInstructionFetch
§ Performanceofspeculativeout-of-ordermachinesoftenlimitedbyinstructionfetchbandwidth- speculativeexecutioncanfetch2-3xmoreinstructionsthanarecommitted
-mispredictpenaltiesdominatedbytimetorefillinstructionwindow
- takenbranchesareparticularlytroublesome
IncreasingTakenBranchBandwidth(Alpha21264I-Cache)
§ Fold2-waytagsandBTBintopredictednextblock§ Taketagchecks,inst.decode,branchpredictoutofloop§ RawRAMspeedoncriticalloop(1cycleat~1GHz)§ 2-bithysteresiscounterperblockpreventsovertraining
CachedInstructions
LinePredict
WayPredict
TagWay0
TagWay1
=? =?
fastfetchpath
PCGeneration
PC
BranchPredictionInstructionDecodeValidityChecks
4insts
Hit/Miss/Way
TournamentBranchPredictor(Alpha21264)
§ Choicepredictorlearnswhetherbesttouselocalorglobalbranchhistoryinpredictingnextbranch
§ Globalhistoryisspeculativelyupdatedbutrestoredonmispredict
§ Claim90-100%successonrangeofapplications
Local history table (1,024x10b)
PC
Local prediction (1,024x3b)
Global Prediction (4,096x2b)
Choice Prediction (4,096x2b)
Global History (12b)Prediction
TakenBranchLimit
§ Integercodeshaveatakenbranchevery6-9instructions
§ Toavoidfetchbottleneck,mustexecutemultipletakenbranchespercyclewhenincreasingperformance
§ Thisimplies:- predictingmultiplebranchespercycle- fetchingmultiplenon-contiguousblockspercycle
BranchAddressCache(Yeh,Marr,Patt)
PCk
EntryPC
=
match
Valid
valid
predicted
target#1
target#1len
len#1
predicted
target#2
target#2
ExtendBTBtoreturnmultiplebranchpredictionspercycle
FetchingMultipleBasicBlocks
§ Requireseither-multiportedcache:expensive- interleaving:bankconflictswilloccur
§ Mergingmultipleblockstofeedtodecodersaddslatencyincreasingmispredictpenaltyandreducingbranchthroughput
TraceCache
§ KeyIdea:Packmultiplenon-contiguousbasicblocksintoonecontiguoustracecacheline
BR BR BR
• Singlefetchbringsinmultiplebasicblocks
• Tracecacheindexedbystartaddress andnextnbranchpredictions
• UsedinIntelPentium-4processortoholddecodeduops
BRBRBR
Load-StoreQueueDesign
§ Aftercontrolhazards,datahazardsthroughmemoryareprobablynextmostimportantbottlenecktosuperscalarperformance
§ Modernsuperscalarsuseverysophisticatedload-storereorderingtechniquestoreduceeffectivememorylatencybyallowingloadstobespeculativelyissued
81
SpeculativeStoreBuffer§ Justlikeregisterupdates,storesshouldnotmodifythememoryuntilaftertheinstructioniscommitted.Aspeculativestorebufferisastructureintroducedtoholdspeculativestoredata.
§ Duringdecode,storebufferslotallocatedinprogramorder
§ Storessplitinto“storeaddress”and“storedata”micro-operations
§ “Storeaddress”executionwritestag§ “Storedata”executionwritesdata§ Storecommitswhenoldestinstructionandbothaddressanddataavailable:- clearspeculativebitandeventuallymovedatatocache
§ Onstoreabort:- clearvalidbit
82
DataTags
StoreCommitPath
SpeculativeStoreBuffer
L1DataCache
Tag DataSVTag DataSVTag DataSVTag DataSVTag DataSVTag DataSV
StoreAddress
StoreData
Loadbypassfromspeculativestorebuffer
83
§ Ifdatainbothstorebufferandcache,whichshouldweuse?Speculativestorebuffer
§ Ifsameaddressinstorebuffertwice,whichshouldweuse?Youngeststoreolderthanload
Data
LoadAddress
Tags
SpeculativeStoreBuffer L1DataCache
LoadData
Tag DataSVTag DataSVTag DataSVTag DataSVTag DataSVTag DataSV
MemoryDependencies
sd x1, (x2)ld x3, (x4)
§ Whencanweexecutetheload?
84
In-OrderMemoryQueue
§ Executeallloadsandstoresinprogramorder
§ =>LoadandstorecannotleaveROBforexecutionuntilallpreviousloadsandstoreshavecompletedexecution
§ Canstillexecuteloadsandstoresspeculatively,andout-of-orderwithrespecttootherinstructions
§ Needastructuretohandlememoryordering…
85
ConservativeO-o-OLoadExecution
sd x1, (x2)ld x3, (x4)
§ Canexecuteloadbeforestore,ifaddressesknownandx4 !=x2
§ Eachloadaddresscomparedwithaddressesofallpreviousuncommittedstores- canusepartialconservativechecki.e.,bottom12bitsofaddress,tosavehardware
§ Don’texecuteloadifanypreviousstoreaddressnotknown
§ (MIPSR10K,16-entryaddressqueue)
86
AddressSpeculation
sd x1, (x2)ld x3, (x4)
§ Guessthatx4 !=x2§ Executeloadbeforestoreaddressknown§ Needtoholdallcompletedbutuncommittedload/storeaddressesinprogramorder
§ Ifsubsequentlyfindx4==x2,squashloadandallfollowinginstructions
§ =>Largepenaltyforinaccurateaddressspeculation
87
MemoryDependencePrediction(Alpha21264)
sd x1, (x2)ld x3, (x4)
§ Guessthatx4 !=x2 andexecuteloadbeforestore
§ Iflaterfindx4==x2,squashloadandallfollowinginstructions,butmarkloadinstructionasstore-wait
§ Subsequentexecutionsofthesameloadinstructionwillwaitforallpreviousstorestocomplete
§ Periodicallyclearstore-waitbits88
PartIII:MoreAdvancedOut-of-OrderSuperscalarDesigns
89
LimitationsofBHTs
90
Onlypredictsbranchdirection.Therefore,cannotredirectfetchstreamuntilafterbranchtargetisdetermined.
UltraSPARC-IIIfetchpipeline
Correctlypredictedtakenbranch
penalty
JumpRegisterpenalty
A PCGeneration/MuxP InstructionFetchStage1F InstructionFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstructionstoFunctionalunitsR RegisterFileReadE IntegerExecute
Remainderofexecutepipeline(+another6stages)
BranchTargetBuffer(BTB)
91
• KeepboththebranchPCandtargetPCintheBTB• PC+4isfetchedifmatchfails• Onlytaken branchesandjumpsheldinBTB• NextPCdeterminedbefore branchfetchedanddecoded
2k-entry direct-mapped BTB(can also be associative)
I-Cache PC
k
Valid
valid
EntryPC
=
match
predicted
target
targetPC
CombiningBTBandBHT
§ BTBentriesareconsiderablymoreexpensivethanBHT,butcanredirectfetchesatearlierstageinpipelineandcanaccelerateindirectbranches(JR)
§ BHTcanholdmanymoreentriesandismoreaccurate
92
A PCGeneration/MuxP InstructionFetchStage1F InstructionFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstructionstoFunctionalunitsR RegisterFileReadE IntegerExecute
BTB
BHTBHTinlaterpipelinestagecorrectswhenBTBmissesapredictedtakenbranch
BTB/BHTonlyupdatedafterbranchresolvesinEstage
UsesofJumpRegister(JR)
§ Switchstatements(jumptoaddressofmatchingcase)
§ Dynamicfunctioncall(jumptorun-timefunctionaddress)
§ Subroutinereturns(jumptoreturnaddress)
93
HowwelldoesBTBworkforeachofthesecases?
BTBworkswellifsamecaseusedrepeatedly
BTBworkswellifsamefunctionusuallycalled,(e.g.,inC++programming,whenobjectshavesametypeinvirtualfunctioncall)
BTBworkswellifusuallyreturntothesameplaceÞ Oftenonefunctioncalledfrommanydistinctcallsites!
SubroutineReturnStack
SmallstructuretoaccelerateJRforsubroutinereturns,typicallymuchmoreaccuratethanBTBs.
94
&fb()&fc()
Pushcalladdresswhenfunctioncallexecuted
Popreturnaddresswhensubroutinereturndecoded
fa() { fb(); }fb() { fc(); }fc() { fd(); }
&fd() kentries(typicallyk=8-16)
ReturnStackinPipeline
§ Howtousereturnstack(RS)indeepfetchpipeline?§ Onlyknowifsubroutinecall/returnatdecode
95
A PCGeneration/MuxP InstructionFetchStage1F InstructionFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstructionstoFunctionalunitsR RegisterFileReadE IntegerExecute
RSRSPush/Popafterdecodegiveslargebubbleinfetchstream.
ReturnStackpredictionchecked
ReturnStackinPipeline
§ CanrememberwhetherPCissubroutinecall/returnusingBTB-likestructure
§ Insteadoftarget-PC,juststorepush/popbit
96
A PCGeneration/MuxP InstructionFetchStage1F InstructionFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstructionstoFunctionalunitsR RegisterFileReadE IntegerExecute
RS
Push/Popbeforeinstructionsdecoded!
ReturnStackpredictionchecked
In-Ordervs.Out-of-OrderBranchPrediction
Fetch
Decode
Execute
Commit
In-OrderIssue Out-of-OrderIssue
Fetch
Decode
Execute
Commit
ROB
Br.Pred.
Resolve
Br.Pred.
Resolve
§ Speculativefetchbutnotspeculativeexecution- branchresolvesbeforelaterinstructionscomplete
§ Completedvaluesheldinbypassnetworkuntilcommit
§ Speculativeexecution,withbranchesresolvedafterlaterinstructionscomplete
§ CompletedvaluesheldinrenameregistersinROBorunifiedphysicalregisterfileuntilcommit
• Bothstylesofmachinecanusesamebranchpredictorsinfront-endfetchpipeline,andbothcanexecutemultipleinstructionspercycle
• Commontohave10-30pipelinestagesineitherstyleofdesign
In-Order
In-Order
In-Order
Out-of-Order
97
InO vs.OoO Mispredict Recovery
§ In-orderexecution?- Designsonoinstructionissuedafterbranchcanwrite-backbeforebranchresolves
- Killallinstructionsinpipelinebehindmispredicted branch§ Out-of-orderexecution?-Multipleinstructionsfollowingbranchinprogramordercancompletebeforebranchresolves
- Asimplesolutionwouldbetohandlelikeprecisetraps- Problem?
98
BranchMispredictioninPipeline
§ CanhavemultipleunresolvedbranchesinROB§ Canresolvebranchesout-of-orderbykillingalltheinstructionsinROBthatfollowamispredicted branch
§ MIPSR10Kusesfourmaskbitstotaginstructionsthataredependentonuptofourspeculativebranches
§ Maskbitsclearedasbranchresolves,andreusedfornextbranch99
Fetch Decode
Execute
CommitReorder Buffer
Kill
Kill Kill
BranchResolution
Inject correct PC
BranchPrediction
PC
Complete
RenameTableRecovery
§ Havetoquicklyrecoverrenametableonbranchmispredicts
§ MIPSR10Konlyhasfoursnapshotsforeachoffouroutstandingspeculativebranches
§ Alpha21264has80snapshots,oneperROBinstruction
100
ImprovingInstructionFetch
§ Performanceofspeculativeout-of-ordermachinesoftenlimitedbyinstructionfetchbandwidth- speculativeexecutioncanfetch2-3xmoreinstructionsthanarecommitted
-mispredictpenaltiesdominatedbytimetorefillinstructionwindow
- takenbranchesareparticularlytroublesome
IncreasingTakenBranchBandwidth(Alpha21264I-Cache)
§ Fold2-waytagsandBTBintopredictednextblock§ Taketagchecks,inst.decode,branchpredictoutofloop§ RawRAMspeedoncriticalloop(1cycleat~1GHz)§ 2-bithysteresiscounterperblockpreventsovertraining
CachedInstructions
LinePredict
WayPredict
TagWay0
TagWay1
=? =?
fastfetchpath
PCGeneration
PC
BranchPredictionInstructionDecodeValidityChecks
4insts
Hit/Miss/Way
TournamentBranchPredictor(Alpha21264)
§ Choicepredictorlearnswhetherbesttouselocalorglobalbranchhistoryinpredictingnextbranch
§ Globalhistoryisspeculativelyupdatedbutrestoredonmispredict
§ Claim90-100%successonrangeofapplications
Local history table (1,024x10b)
PC
Local prediction (1,024x3b)
Global Prediction (4,096x2b)
Choice Prediction (4,096x2b)
Global History (12b)Prediction
TakenBranchLimit
§ Integercodeshaveatakenbranchevery6-9instructions
§ Toavoidfetchbottleneck,mustexecutemultipletakenbranchespercyclewhenincreasingperformance
§ Thisimplies:- predictingmultiplebranchespercycle- fetchingmultiplenon-contiguousblockspercycle
BranchAddressCache(Yeh,Marr,Patt)
PCk
EntryPC
=
match
Valid
valid
predicted
target#1
target#1len
len#1
predicted
target#2
target#2
ExtendBTBtoreturnmultiplebranchpredictionspercycle
FetchingMultipleBasicBlocks
§ Requireseither-multiportedcache:expensive- interleaving:bankconflictswilloccur
§ Mergingmultipleblockstofeedtodecodersaddslatencyincreasingmispredictpenaltyandreducingbranchthroughput
TraceCache
§ KeyIdea:Packmultiplenon-contiguousbasicblocksintoonecontiguoustracecacheline
BR BR BR
• Singlefetchbringsinmultiplebasicblocks
• Tracecacheindexedbystartaddress andnextnbranchpredictions
• UsedinIntelPentium-4processortoholddecodeduops
BRBRBR
Load-StoreQueueDesign
§ Aftercontrolhazards,datahazardsthroughmemoryareprobablynextmostimportantbottlenecktosuperscalarperformance
§ Modernsuperscalarsuseverysophisticatedload-storereorderingtechniquestoreduceeffectivememorylatencybyallowingloadstobespeculativelyissued
108
SpeculativeStoreBuffer§ Justlikeregisterupdates,storesshouldnotmodifythememoryuntilaftertheinstructioniscommitted.Aspeculativestorebufferisastructureintroducedtoholdspeculativestoredata.
§ Duringdecode,storebufferslotallocatedinprogramorder
§ Storessplitinto“storeaddress”and“storedata”micro-operations
§ “Storeaddress”executionwritestag§ “Storedata”executionwritesdata§ Storecommitswhenoldestinstructionandbothaddressanddataavailable:- clearspeculativebitandeventuallymovedatatocache
§ Onstoreabort:- clearvalidbit
109
DataTags
StoreCommitPath
SpeculativeStoreBuffer
L1DataCache
Tag DataSVTag DataSVTag DataSVTag DataSVTag DataSVTag DataSV
StoreAddress
StoreData
Loadbypassfromspeculativestorebuffer
110
§ Ifdatainbothstorebufferandcache,whichshouldweuse?Speculativestorebuffer
§ Ifsameaddressinstorebuffertwice,whichshouldweuse?Youngeststoreolderthanload
Data
LoadAddress
Tags
SpeculativeStoreBuffer L1DataCache
LoadData
Tag DataSVTag DataSVTag DataSVTag DataSVTag DataSVTag DataSV
MemoryDependencies
sd x1, (x2)ld x3, (x4)
§ Whencanweexecutetheload?
111
In-OrderMemoryQueue
§ Executeallloadsandstoresinprogramorder
§ =>LoadandstorecannotleaveROBforexecutionuntilallpreviousloadsandstoreshavecompletedexecution
§ Canstillexecuteloadsandstoresspeculatively,andout-of-orderwithrespecttootherinstructions
§ Needastructuretohandlememoryordering…
112
ConservativeO-o-OLoadExecution
sd x1, (x2)ld x3, (x4)
§ Canexecuteloadbeforestore,ifaddressesknownandx4 !=x2
§ Eachloadaddresscomparedwithaddressesofallpreviousuncommittedstores- canusepartialconservativechecki.e.,bottom12bitsofaddress,tosavehardware
§ Don’texecuteloadifanypreviousstoreaddressnotknown
§ (MIPSR10K,16-entryaddressqueue)
113
AddressSpeculation
sd x1, (x2)ld x3, (x4)
§ Guessthatx4 !=x2§ Executeloadbeforestoreaddressknown§ Needtoholdallcompletedbutuncommittedload/storeaddressesinprogramorder
§ Ifsubsequentlyfindx4==x2,squashloadandallfollowinginstructions
§ =>Largepenaltyforinaccurateaddressspeculation
114
MemoryDependencePrediction(Alpha21264)
sd x1, (x2)ld x3, (x4)
§ Guessthatx4 !=x2 andexecuteloadbeforestore
§ Iflaterfindx4==x2,squashloadandallfollowinginstructions,butmarkloadinstructionasstore-wait
§ Subsequentexecutionsofthesameloadinstructionwillwaitforallpreviousstorestocomplete
§ Periodicallyclearstore-waitbits115
Acknowledgements
§ Thesecoursenotesweredevelopedby:- Krste Asanovic (UCB)- Arvind(MIT)- JoelEmer (Intel/MIT)- JamesHoe(CMU)- JohnKubiatowicz (UCB)- DavidPatterson(UCB)
116