Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
CS 61C: Great Ideas in Computer Architecture Lecture 13 ...cs61c/sp18/lec/13/lec13.pdf · Great...
Transcript of CS 61C: Great Ideas in Computer Architecture Lecture 13 ...cs61c/sp18/lec/13/lec13.pdf · Great...
CS61C:GreatIdeasinComputerArchitecture
Lecture13:Pipelining
JohnWawrzynek&NickWeaverhttp://inst.eecs.berkeley.edu/~cs61c/sp18
Lecture13:Pipelining
Agenda• RISC-VPipeline• PipelineControl• Hazards
− Structural− Data
▪ R-typeinstructions▪ Load
− Control• SuperscalarprocessorsCS61c 2
Recap:PipeliningwithRISC-V
CS61c 3
addt0,t1,t2
ort3,t4,t5
sllt6,t0,t3tcycle
instructionsequence
tinstruction
SingleCycle Pipelining
Timing tstep=100…200ps tcycle=200ps
Registeraccessonly100ps Allcyclessamelength
Instructiontime,tinstruction =tcycle=800ps 1000ps
Clockrate,fs 1/800ps=1.25GHz 1/200ps=5GHz
Relativespeed 1x 4x
Lecture13:Pipelining
RISC-VPipelineaddt0,t1,t2
ort3,t4,t5
sltt6,t0,t3
tcycle=200ps
instructionsequence
tinstruction=1000ps
swt0,4(t3)
lwt0,8(t3)
addit2,t2,1
Resourceuseofinstructionovertime
Resourceuseinaparticulartimeslot
CS61c 4
Single-CycleRISC-VRV32IDatapath
CS61c 5
IMEMALU
Imm.Gen
+4
DMEM
BranchComp.
Reg[]
AddrAAddrB
DataA
AddrD
DataB
DataD
Addr
DataWDataR
1
0
0121
0pc
0
1
inst[11:7]
inst[19:15]
inst[24:20]
inst[31:7]
pc+4alu
mem
wb
alu
pc+4
pc
imm[31:0]
Reg[rs2]
inst[31:0] ImmSel RegWEn BrUnBrEq BrLT ASelBSel ALUSel MemRW WBSelPCSel
wbReg[rs1]
PipeliningRISC-VRV32IDatapath
CS61c 6
IMEMALU
Imm.Gen
+4
DMEM
BranchComp.
Reg[]
AddrAAddrB
DataA
AddrD
DataB
DataD
Addr
DataWDataR
1
0
0121
0pc
0
1
inst[11:7]
inst[19:15]
inst[24:20]
inst[31:7]
pc+4alu
mem
wb
alu
pc+4
pc
imm[31:0]
Reg[rs2]
wb
InstructionFetch(F)
InstructionDecode/RegisterRead
(D)
ALUExecute(X)
MemoryAccess(M)
WriteBack(W)
Reg[rs1]
PipelinedRISC-VRV32IDatapath
CS61c7
IMEM
ALU+4
DMEMBranchComp.
Reg[]
AddrA
AddrB
DataA
AddrD
DataB
DataD
Addr
DataWDataR
1
0aluX
pcF+4
+4pcDpcFpcX pcM
instD
instX
rs1X
rs2X
aluM
rs2MimmXImm.
RecalculatePC+4inMstagetoavoidsendingbothPCandPC+4downpipeline
instM instW
Mustpipelineinstructionalongwithdata,socontroloperatescorrectlyineachstage
Eachstageoperatesondifferentinstruction
CS61c8
IMEM
ALU
+4
DMEMBranchComp.
Reg[]
AddrA
AddrB
DataA
AddrD
DataB
DataD
Addr
DataWDataR
1
0aluX
pcF+4
+4pcDpcFpcX pcM
instD
instX
rs1X
rs2X
aluM
rs2MimmXImm.instM instW
addt0,t1,t2
ort3,t4,t5sltt6,t0,t3swt0,4(t3)lwt0,8(t3)
Pipelineregistersseparatestages,holddataforeachinstructioninflight
ClickerQuestion
9
Time=InstructionsCyclesTimeProgramProgram*Instruction*Cycle
Pipeliningthesingle-cycleprocessorcanincreaseprocessorperformanceby:
Instructions/program
Cycles/instruction
Time/cycle
A decrease decrease same
B same increase decrease
C same same decrease
D increase decrease increase
Lecture13:Pipelining
Agenda• RISC-VPipeline• PipelineControl• Hazards
− Structural− Data
▪ R-typeinstructions▪ Load
− Control• SuperscalarprocessorsCS61c 10
PipelinedControl• Controlsignalsderivedfrominstruction
− Asinsingle-cycleimplementation− Informationisstoredinpipelineregistersforusebylaterstages
CS61c 11
Lecture13:Pipelining
HazardsAhead
CS61c 12
Lecture13:Pipelining
Agenda• RISC-VPipeline• PipelineControl• Hazards
− Structural− Data
▪ R-typeinstructions▪ Load
− Control• SuperscalarprocessorsCS61c 13
Lecture13:Pipelining
StructuralHazard• Problem:Twoormoreinstructionsinthepipelinecompeteforaccesstoasinglephysicalresource• Solution1:Instructionstaketurnstouseresource,someinstructionshavetostall• Solution2:Addmorehardwaretomachine• Canalwayssolveastructuralhazardbyaddingmorehardware
CS61c 14
Lecture13:Pipelining
RegfileStructuralHazards• Eachinstruction:
− canreaduptotwooperandsindecodestage− canwriteonevalueinwritebackstage
• Avoidstructuralhazardbyhavingseparate“ports”− twoindependentreadportsandoneindependentwriteport
• Threeaccessespercyclecanhappensimultaneously
CS61c 15
Lecture13:Pipelining
StructuralHazard:MemoryAccess
addt0,t1,t2
ort3,t4,t5
sltt6,t0,t3
instructionsequence
swt0,4(t3)
lwt0,8(t3)
• Instructionanddatamemoryusedsimultaneously✓ Usetwoseparate
memories
CS61c 16
Lecture13:Pipelining
InstructionandDataCaches
17CS61c
Processor
Control
DatapathPC
RegistersArithmetic&LogicUnit
(ALU)
Memory(DRAM)
Bytes
Program
Data
InstructionCache
DataCache
Caches:smallandfast“buffer”memories
Lecture13:Pipelining
StructuralHazards–Summary• Conflictforuseofaresource• InRISC-Vpipelinewithasinglememory
− Load/storerequiresdataaccess− Withoutseparatememories,instructionfetchwouldhavetostallforthatcycle▪ Allotheroperationsinpipelinewouldhavetowait
• Pipelineddatapathsrequireseparateinstruction/datamemories− Orseparateinstruction/datacaches
• RISCISAs(includingRISC-V)designedtoavoidstructuralhazards− e.g.atmostonememoryaccess/instruction
18
Lecture13:Pipelining
Agenda• RISC-VPipeline• PipelineControl• Hazards
− Structural− Data
▪ R-typeinstructions▪ Load
− Control• SuperscalarprocessorsCS61c 19
Lecture13:Pipelining
DataHazard:RegisterAccess
addt0,t1,t2
ort3,t4,t5
sltt6,t4,t3
instructionsequence
swt0,4(t3)
lwt0,8(t3)
• Separateports,butwhatifwritetosamevalueasread?• Doesswintheexamplefetchtheoldornewvalue?
CS61c 20
Lecture13:Pipelining
RegisterAccessPolicy
addt0,t1,t2
ort3,t4,t5
sltt6,t4,t3
instructionsequence swt0,4(t3)
lwt0,8(t3)
• Exploithighspeedofregisterfile(100ps)
1) WBupdatesvalue2) IDreadsnewvalue
• Indicatedindiagrambyshading
CS61c 21
Mightnotalwaysbepossibletowritethenreadinsamecycle,especiallyinhigh-frequencydesigns.
Lecture13:Pipelining
DataHazard:ALUResult
adds0,t0,t1
subt2,s0,t0
ort6,s0,t3
instructionsequence
xort5,t1,s0
sws0,8(t3)
5 5 5 5 5/9 9 9 9 9Valueofs0
Withoutsomefix,subandorwillcalculatewrongresult!CS61c 22
s0holds“5”thenaddinstrchangess0to“9”
Solution1:Stalling• Problem:Instructiondependsonresultfrompreviousinstruction
− add s0,t0,t1sub t2,s0,t3
• Bubble:• stalldependentinstruction− effectivelyNOP:affectedpipelinestagesdo“nothing”
StallsandPerformance
• Stallsreduceperformance− Butstallsarerequiredtogetcorrectresults
• Compilercouldtrytoarrangecodetoavoidhazardsandstalls− Requiresknowledgeofthepipelinestructure
CS61c 24
Solution2:Forwarding
addt0,t1,t2
ort3,t0,t5
subt6,t0,t3
instructionsequence
xort5,t1,t0
swt0,8(t3)
5 5 5 5 5/9 9 9 9 9Valueoft0
Forwarding:graboperandfrompipelinestage,ratherthanregisterfileCS61c 25
Lecture13:Pipelining
Forwarding(akaBypassing)• Useresultwhenitiscomputed
− Don’twaitforittobestoredinaregister− Requiresextraconnectionsinthedatapath
CS61c 26
1)DetectNeedforForwarding(example)
addt0,t1,t2
ort3,t0,t5
subt6,t0,t3
instX.rd
instD.rs1
CS61c 27
Comparedestinationofolderinstructionsinpipelinewithsourcesofnewinstructionindecodestage.Mustignorewritestox0!
ExampleForwardingPath
CS61c28
IMEM
ALU
+4
DMEMBranchComp.
Reg[]
AddrA
AddrB
DataA
AddrD
DataB
DataD
Addr
DataWDataR
1
0aluX
pcF+4
+4pcDpcFpcX pcM
instD
instX
rs1X
rs2X
aluM
rs2MimmXImm.instM instW
ForwardingControlLogic
Sameideaextendstors2,andtoinstructioninstD,instMpairing
Administrivia• Project3.1stillduenextWednesday(3/7)• Homework2dueFriday(11:59PM)• ProjectpartyonbothMonday(8-10,Cory293)andWednesday(7-10,Cory293)• GuerrillasessionTonight7-9pm,Barrows20!• Midterm2,March20,ismovedto8-10PM(was7-9onthewebsite)• Alternativeexamearlier,6-8PM(sopeopledon’tneedtobeinexamsuntilmidnight:)• submitexamconflictformiftheyhaven’t
CS61c 29
Lecture13:Pipelining
Agenda• RISC-VPipeline• PipelineControl• Hazards
− Structural− Data
▪ R-typeinstructions▪ Load
− Control• SuperscalarprocessorsCS61c 30
Lecture13:Pipelining
LoadDataHazard
1cyclestallunavoidable
CS61c 31
forward
unaffected
Lecture13:Pipelining
StallPipeline
Stall
CS61c 32
repeatandinstructionandforward
Lecture13:Pipelining
lwDataHazard• Slotafteraloadiscalledaloaddelayslot
− Ifthatinstructionusestheresultoftheload,thenthehardwarewillstallforonecycle
− Equivalenttoinsertinganexplicitnopintheslot▪ exceptthelatterusesmorecodespace
− Performanceloss!• Idea:
− Putunrelatedinstructionintoloaddelayslot− Noperformanceloss!
33CS61c
Lecture13:Pipelining
CodeSchedulingtoAvoidStalls• Reordercodetoavoiduseofloadresultinthenextinstruction!• RISC-VcodeforD=A+B; E=A+C;
34
Original Order: lw t1, 0(t0) lw t2, 4(t0) add t3, t1, t2 sw t3, 12(t0) lw t4, 8(t0) add t5, t1, t4 sw t5, 16(t0)
Alternative: lw t1, 0(t0) lw t2, 4(t0) lw t4, 8(t0) add t3, t1, t2 sw t3, 12(t0) add t5, t1, t4 sw t5, 16(t0)
Stall!
Stall!
13cycles11cyclesCS61c
Lecture13:Pipelining
Agenda• RISC-VPipeline• PipelineControl• Hazards
− Structural− Data
▪ R-typeinstructions▪ Load
− Control• SuperscalarprocessorsCS61c 35
Lecture13:Pipelining
ControlHazards
beqt0,t1,label
subt2,s0,t5
ort6,s0,t3
xort5,t1,s0
sws0,8(t3)
executedregardlessofbranchoutcome!
executedregardlessofbranchoutcome!!!
PCupdatedreflectingbranchoutcome
CS61c 36
Lecture13:Pipelining
Observation• Ifbranchnottaken,theninstructionsfetchedsequentiallyafterbrancharecorrect• Ifbranchorjumptaken,thenneedtoflushincorrectinstructionsfrompipelinebyconvertingtoNOPs
CS61c 37
Lecture13:Pipelining
KillInstructionsafterBranchifTaken
beqt0,t1,label
subt2,s0,t5
ort6,s0,t3
label:xxxxxx PCupdatedreflectingbranchoutcome
CS61c 38
Takenbranch
ConverttoNOP
ConverttoNOP
Lecture13:Pipelining
ReducingBranchPenalties• Everytakenbranchinsimplepipelinecosts2deadcycles• Toimproveperformance,use“branchprediction”toguesswhichwaybranchwillgoearlierinpipeline• Onlyflushpipelineifbranchpredictionwasincorrect
CS61c 39
Lecture13:Pipelining
BranchPrediction
beqt0,t1,label
label:…..
…..
CS61c 40
Takenbranch
GuessnextPC!
Checkguesscorrect
Lecture13:Pipelining
Agenda• RISC-VPipeline• PipelineControl• Hazards
− Structural− Data
▪ R-typeinstructions▪ Load
− Control• SuperscalarprocessorsCS61c 41
Lecture13:Pipelining
IncreasingProcessorPerformance1. Clockrate
− Limitedbytechnologyandpowerdissipation2. Pipelining
− “Overlap”instructionexecution− Deeperpipeline:5=>10=>15stages
▪ Lessworkperstageà shorterclockcycle▪ Butmorepotentialforhazards(CPI>1)
3. Multi-issue“super-scalar”processor− Multipleexecutionunits(ALUs)
▪ Severalinstructionsexecutedsimultaneously▪ CPI<1(ideally)
CS61c 42
Lecture13:Pipelining
SuperscalarProcessor
CS61c 43
P&Hp.340
Lecture13:Pipelining
Benchmark:CPIofIntelCorei7
CS61c 44
CPI=1
P&Hp.350
Lecture13:Pipelining
InConclusion• Pipeliningincreasesthroughputbyoverlappingexecutionofmultipleinstructions• Allpipelinestageshavesameduration
− Choosepartitionthataccommodatesthisconstraint
• Hazardspotentiallylimitperformance− Maximizingperformancerequiresprogrammer/compilerassistance− E.g.LoadandBranchdelayslots
• Superscalarprocessorsusemultipleexecutionunitsforadditionalinstructionlevelparallelism− Performancebenefithighlycodedependent
45CS61c
Lecture13:Pipelining
ExtraSlides
CS61c 46
Lecture13:Pipelining
PipeliningandISADesign• RISC-VISAdesignedforpipelining
− Allinstructionsare32-bits▪ Easytofetchanddecodeinonecycle▪ Versusx86:1-to15-byteinstructions
− Fewandregularinstructionformats▪ Decodeandreadregistersinonestep
− Load/storeaddressing▪ Calculateaddressin3rdstage,accessmemoryin4thstage
− Alignmentofmemoryoperands▪ Memoryaccesstakesonlyonecycle
CS61c 47
Lecture13:Pipelining
SuperscalarProcessor• Multipleissue“superscalar”
− Replicatepipelinestages⇒multiplepipelines− Startmultipleinstructionsperclockcycle− CPI<1,souseInstructionsPerCycle(IPC)− E.g.,4GHz4-waymultiple-issue
▪ 16BIPS,peakCPI=0.25,peakIPC=4− Dependenciesreducethisinpractice
• “Out-of-Order”execution− Reorderinstructionsdynamicallyinhardwaretoreduceimpactofhazards
• CS152discussesthesetechniques!CS61c 48