SuperScalar Design Prime
description
Transcript of SuperScalar Design Prime
SUPERSCALAR DESIGN PRIMEZhao ZhangCprE 381, Computer Organization and Assembly-Level Programming, Fall 2012Original slides from CprE 581, Advanced Computer Architecture
History Superscalar DesignFirst appearance in 1960s
• Scoreboarding• Tomasulo Algorithm
Popular use since 1990s• SGI MIPS processors• Sun UltraSPARC• Dec Alpha 21x64 series• Intel/AMD processors
Now appearing in embedded processors• Cortex-A9: Two-way, limited out-of-order• Certex-A15: Three-way, close to Intel/AMD design
Why SuperscalarGet more performance than scalar pipeline
Superscalar Techniques: Deep pipelineMulti-issueBranch predictionRegister renamingOut-of-order ExecutionSpeculative ExecutionMemory disambiguation
Code Examplefor (i = 0; i < 1000; i++) X[i] = X[i] + b;
; loop body, initialization not shown; R4: &X[i], R5: (X+1000)*4, R6: bLoop: LW R8, R4($0) ; load X[i], R4 stores X
ADD R8, R8, R6 ; X[i] = X[i] + bSW R8, R4($0) ; store X[i]ADDI R4, R4, 4 ; next elementSLT R9, R4, R5 ; R9 = (R4 < R5)BNE R9, R0, loop ; end of loop?
Frontend and BackendFrontend: In-order fetch, decode, and renameBackend: Out-of-order issue, execute/writeback, in-order commit
Frontend may send “junk” instructions to the backend• Junk instructions occur with branch mis-prediction or exceptions• Design goal: Minimize the percentage of “junk” instructions
Backend must be able to detect and handle “junk” instructions• Flush junk instructions upon detetion• In-order commit (retire) so that junk instructions won’t affect the
“architectural state”• Dozens of cycles likely for handling a branch mis-prediction
Frontend and Backend
FrontendBackend
“Cortex-A9 Processor Microarchitecture”, slide 6
The Multi-Issue FactorMulti-issue affects all pipeline stages: In the same cycle,
• N inst. are fetched: Usually from one I-cache block• N inst. are decoded: Multiple decoders• N inst. are renamed: Multi-ported renaming table, detecting intra-
group dependence
In the backend• Up to N inst. are scheduled: Multi-ported queue with broadcast• N inst. read register file: Multi-ported register file• M inst. are executed at functional units: Multiple functional units• N inst. writes back register values: multi-ported register file• N inst. are committed: Multi-banked reorder buffer, also involves
rename tableNote: “N” is not necessary the same value across pipeline stages
Frontend: Branch PredictionBranch prediction is critical to reducing “junk” instructions
With “disaster” branch prediction performance:
SPECint programs have on average ~15% branches• Every 100 instructions contain 15 branches• Assume 10% mis-prediction => 1.5 branch mis-predictions• Assume 20-cycle mis-prediction penalty => 30 lost cycles• Assume IPC=3.0 => 33.3 cycles for execution 100 inst• 90% loss for the 10% mis-prediction• Mis-prediction penalty is workload-dependent, and can be
significantly longer than 20 cycles
good inst good inst good inst
Frontend: Branch PredictionBranch prediction is made every cycle• Otherwise, instruction flow
stops• It’s done in parallel with
instruction fetch
The backend sends back feedback about past predictions
Inst.Cache
Pred-PC
INST
Target, branch, and return addr.
predictors
Single cycle loop
Feedback from the backend
Frontend: Branch PredictionThree components in simple design Branch Target Buffer (BTB): What’s the branch target? Branch History Table (BHT): Is the branch taken or not? Return Address Stack (RAS)
• Function return is a special type of branch instruction• There are multiple valid branch targets for the return
How BTB and BHT works in general• Bet the same patterns will repeat• Use only PC and past branch outcome history in the prediction
Frontend: Branch PredictionBranch Target Buffer with combined Branch History Table
Branch PC Predicted PC
=?
PC of instructionFETCH
Extra prediction stateBits (see later)
Yes: instruction is branch and use predicted PC as next PC
No: branch not predicted, proceed normally
(Next PC = PC+4)From slides of CprE 581 Computer Systems Architecture
Frontend: Branch Prediction
First time fetching at BNE: Predicted as Not TakenLoop: LW R8, R4($0) ; load X[i], R4 stores X
ADD R8, R8, R6 ; X[i] = X[i] + bSW R8, R4($0) ; store X[i]ADDI R4, R4, 4 ; next elementSLT R9, R4, R5 ; end of array?BNE R9, R0, loop => mis-prediction on 1st fetch
-- -- 0-- -- 0-- -- 0-- -- 0-- -- 0-- -- 0
Branch PC Predicted PCLWADDSWADDISTLBNE
=> NT, right=> NT, right=> NT, right=> NT, right=> NT, right=> NT, WRONG
Frontend: Branch Prediction
What happen after the mis-prediction1. The frontend starts fetch junk instructions, probably in dozens2. The backend detects the mis-prediction, flush backend
pipeline, notifies the frontend about the mis-predicted branch3. The frontend updates the BTB/BHT, filling in BNE-PC and
LW-PC, change prediction state bit4. The frontend restarts fetching from LW-PC
-- -- 0-- -- 0-- -- 0-- -- 0-- -- 0
BNE-PC LW-PC 1
Branch PC Predicted PCLWADDSWADDISTLBNE
Frontend: Branch Prediction
2nd time fetching at BNE: Predicted as Taken, jump to LW-PCLoop:
LW R8, R4($0) ; load X[i], R4 stores XADD R8, R8, R6 ; X[i] = X[i] + bSW R8, R4($0) ; store X[i]ADDI R4, R4, 4 ; next elementSLT R9, R4, R5 ; end of array?
=> BNE R9, R0, loop ;
-- -- 0-- -- 0-- -- 0-- -- 0-- -- 0
BNE-PC LW-PC 1
Branch PC Predicted PCLWADDSWADDISTLBNE
=> NT, right=> NT, right=> NT, right=> NT, right=> NT, right=> Taken, RIGHT
Frontend: Branch Prediction
Last time fetching at BNE-PC, predicted as Taken• It’s wrong because the loop will exit
This time, the prediction state bit is changed to 0• Next time the prediction outcome on BNE-PC is Not Taken
-- -- 0-- -- 0-- -- 0-- -- 0-- -- 0
BNE-PC LW-PC 0
Branch PC Predicted PCLWADDSWADDISTLBNE
16
Branch Prediction State Bit
state2. PredictOutput T/NT
1. Access
3. Feedback T/NT
T
Predict Taken Predict NotTaken
1 0T
NT
General Form
1-bit prediction
NT
PC
Feedback
From CprE 581, Computer Systems Architecture
Branch History TableBranch direction prediction is usually more challenging• BHT can be separated from BTB (often the case)• 2-bit or 3-bit state are usually used• BHT can be organized in two levels to predict on
correlation between branches• BHT can have sophisticated organizations to further
improve accuracy
Return Address Stack: Work on return instructions, simple and effective (not to be discussed more)
Frontend: Register RenamingConsider two loop iterations: Conflict on register usage, cannot be executed in parallel, but they are mostly parallel
LW R8, R4($0) ; load X[i], R4 stores XADD R8, R8, R6 ; X[i] = X[i] + bSW R8, R4($0) ; store X[i]ADDI R4, R4, 4 ; next elementSLT R9, R4, R5 ; end of array?BNE R9, R0, loop ;LW R8, R4($0) ; load X[i], R4 stores XADD R8, R8, R6 ; X[i] = X[i] + bSW R8, R4($0) ; store X[i]ADDI R4, R4, 4 ; next elementSLT R9, R4, R5 ; end of array?BNE R9, R0, loop ;
Frontend: Register RenamingRename architectural registers to physical registers, remove false dependence and keep true dep.
LW P32, P4($0) ; load X[i], R4 stores XADD P33, P32, P6 ; X[i] = X[i] + bSW P33, P4($0) ; store X[i]ADDI P34, P4, 4 ; next elementSLT P35, P34, P5 ; end of array?BNE P35, P0, loop ;LW P36, P34($0) ; load X[i], R4 stores XADD P37, P36, P6 ; X[i] = X[i] + bSW P37, P34($0) ; store X[i]ADDI P38, P34, 4 ; next elementSLT P38, P38, P5 ; end of array?BNE R38, p0, loop ;
Frontend: Register RenamingHow the design works:
• There is a register mapping table that maps architecture register to physical register
• There is a queue of free physical register• Every instruction with output register is assigned with an unused,
free physical register• Another mapping table is used to recover from mis-predicted path• There are a number of design variants in real processors
Frontend: Register RenamingThe roles of register renaming:
• Remove register name dependence, keep true data dependence, so that more instructions can be safely reordered
• Help backend implement speculative execution, as no junk instructions cannot affect the input of good instructions• A younger instruction writes to newly assigned physical register, so it
cannot affect the input of old instructions• A good instruction is always older than any junk instruction
Backend: Out-Of-Order SchedulingCommon Design: Issue Queue
LW P32 P4yes yes 0x0 yes
Op busy? dst src1 ready? src2 ready? ROB LSQ
ADD P33 P32yes no P6 yes
SW -- P33yes no P4 yes
ADDI P34 P4yes yes 0x4 yes
SLT P35 P34yes no P5 yes
BNE -- P35yes no P0 yes
1
2
3
4
5
6
1
-
2
-
-
-
Backend: Out-Of-Order SchedulingSchedule: Select ready instructions, broadcast their tag (dst) to all other instructions for matching
LW P32 P4yes yes 0x0 yes
Op busy? dst src1 ready? src2 ready? ROB LSQ
ADD P33 P32yes no P6 yes
SW -- P33yes no P4 yes
ADDI P34 P4yes yes 0x4 yes
SLT P35 P34yes no P5 yes
BNE -- P35yes no P0 yes
1
3
2
4
5
6
1
-
2
-
-
-
Backend: Out-Of-Order SchedulingAfter LW and ADDI are issued, assume no new instructions
-- -- --no -- -- --
Op busy? dst src1 ready? src2 ready? ROB LSQ
ADD P33 P32yes yes P6 yes
SW -- P33yes no P4 yes
-- -- ---- -- -- --
SLT P35 P34yes yes P5 yes
BNE -- P35yes no P0 yes
--
2
3
--
5
6
--
-
2
-
-
-
Backend: Out-Of-Order SchedulingAfter ADD and SLT are issued, assume no new instructions
-- -- --no -- -- --
Op busy? dst src1 ready? src2 ready? ROB LSQ
-- -- --no -- -- --
SW -- P33yes yes P4 yes
-- -- ---- -- -- --
-- -- ---- -- -- --
BNE -- P35yes yes P0 yes
--
--
2
--
--
6
--
-
2
-
-
-
Backend: Out-Of-Order SchedulingHow the design works• Instructions are sent to the issue queue after renaming• A select logic chooses up to N instructions, all
dependence free, to be executed• The tag of the selected instructions are broadcast to all
other queue entries• A wakeup logic clears the dependence of other
instructions on the selected instructions
Two major design variants: Issue Queue vs. Reservation Station
Backend: Register Read, Data Forwarding and Writeback
Note: In reservation-station design, register-read happens before instruction scheduling
Issue Queue
Register File
Forwarding Network
LoadStore Int Mult
Div Other
Issue (scheduling)
Reg-Read
Execute
Writeback
28
Reorder Buffer and In-Order Commit
…
head tail
…
head tail
…
head tailfreed
allocated
Reorder Buffer and In-Order CommitInstructions enter and leave
ROB in program order
“Architectural Register State” changes in program order
Junk instructions may produce values, but their values never appear in the “Architectural Register State”• Junk instructions will be flushed
upon detection
29
Reorder Buffer
Dest arch regDest phy reg
Exceptions?
Program Counter
Branch or L/W?
Ready?
Recall the Renaming ExampleConsider two loop iterations: Rename architectural registers to physical registers, remove false dependence and keep true dep.
LW P32, P4($0) ; load X[i], R4 stores XADD P33, P32, P6 ; X[i] = X[i] + bSW P33, P4($0) ; store X[i]ADDI P34, P4, 4 ; next elementSLT P35, P34, P5 ; end of array?BNE P35, P0, loop ;LW P36, P34($0) ; load X[i], R4 stores XADD P37, P36, P6 ; X[i] = X[i] + bSW P37, P34($0) ; store X[i]ADDI P38, P34, 4 ; next elementSLT P38, P38, P5 ; end of array?BNE R38, p0, loop ;
Architectural Register StateLW R8, R4($0)ADD R8, R8, R6SW R8, R4($0)ADDI R4, R4, 4SLT R9, R4, R5BNE R9, R0, loopLW R8, R4($0)ADD R8, R8, R6SW R8, R4($0)ADDI R4, R4, 4SLT R9, R4, R5BNE R9, R0, loop
Mis-predictedpath
R0 R4 R5 R6 R8 R9P0 P4 P5 P6 P8 P9
R0 R4 R5 R6 R8 R9P0 P4 P5 P6 P8 P9
architectural register mapping
speculative register mapping
R0 R4 R5 R6 R8 R9P0 P4 P5 P6 P8 P9
R0 R4 R5 R6 R8 R9P0 P34 P5 P6 P33 P35
architectural register mapping
speculative register mapping
R0 R4 R5 R6 R8 R9P0 P34 P5 P6 P33 P35
R0 R4 R5 R6 R8 R9P0 P38 P5 P6 P37 P39
architectural register mapping
speculative register mapping
SummaryWhat we have learned• In-order frontend vs. out-of-order backend• Branch prediction to keep instruction flow• Register renaming to remove name dependence and
support speculative execution• Out-of-order scheduling with issue queue• In-order commit with re-order buffer
What we haven’t learned yet• Memory disambiguation using load/queue and store queue• Detail in complex real processors