SuperScalar Design Prime

SUPERSCALAR DESIGN PRIMEZhao ZhangCprE 381, Computer Organization and Assembly-Level Programming, Fall 2012Original slides from CprE 581, Advanced Computer Architecture

History Superscalar DesignFirst appearance in 1960s

• Scoreboarding• Tomasulo Algorithm

Popular use since 1990s• SGI MIPS processors• Sun UltraSPARC• Dec Alpha 21x64 series• Intel/AMD processors

Now appearing in embedded processors• Cortex-A9: Two-way, limited out-of-order• Certex-A15: Three-way, close to Intel/AMD design

Why SuperscalarGet more performance than scalar pipeline

Superscalar Techniques: Deep pipelineMulti-issueBranch predictionRegister renamingOut-of-order ExecutionSpeculative ExecutionMemory disambiguation

Code Examplefor (i = 0; i < 1000; i++) X[i] = X[i] + b;

; loop body, initialization not shown; R4: &X[i], R5: (X+1000)*4, R6: bLoop: LW R8, R4($0) ; load X[i], R4 stores X

ADD R8, R8, R6 ; X[i] = X[i] + bSW R8, R4($0) ; store X[i]ADDI R4, R4, 4 ; next elementSLT R9, R4, R5 ; R9 = (R4 < R5)BNE R9, R0, loop ; end of loop?

Frontend and BackendFrontend: In-order fetch, decode, and renameBackend: Out-of-order issue, execute/writeback, in-order commit

Frontend may send “junk” instructions to the backend• Junk instructions occur with branch mis-prediction or exceptions• Design goal: Minimize the percentage of “junk” instructions

Backend must be able to detect and handle “junk” instructions• Flush junk instructions upon detetion• In-order commit (retire) so that junk instructions won’t affect the

“architectural state”• Dozens of cycles likely for handling a branch mis-prediction

Frontend and Backend

FrontendBackend

“Cortex-A9 Processor Microarchitecture”, slide 6

The Multi-Issue FactorMulti-issue affects all pipeline stages: In the same cycle,

• N inst. are fetched: Usually from one I-cache block• N inst. are decoded: Multiple decoders• N inst. are renamed: Multi-ported renaming table, detecting intra-

group dependence

In the backend• Up to N inst. are scheduled: Multi-ported queue with broadcast• N inst. read register file: Multi-ported register file• M inst. are executed at functional units: Multiple functional units• N inst. writes back register values: multi-ported register file• N inst. are committed: Multi-banked reorder buffer, also involves

rename tableNote: “N” is not necessary the same value across pipeline stages

Frontend: Branch PredictionBranch prediction is critical to reducing “junk” instructions

With “disaster” branch prediction performance:

SPECint programs have on average ~15% branches• Every 100 instructions contain 15 branches• Assume 10% mis-prediction => 1.5 branch mis-predictions• Assume 20-cycle mis-prediction penalty => 30 lost cycles• Assume IPC=3.0 => 33.3 cycles for execution 100 inst• 90% loss for the 10% mis-prediction• Mis-prediction penalty is workload-dependent, and can be

significantly longer than 20 cycles

good inst good inst good inst

Frontend: Branch PredictionBranch prediction is made every cycle• Otherwise, instruction flow

stops• It’s done in parallel with

instruction fetch

The backend sends back feedback about past predictions

Inst.Cache

Pred-PC

INST

Target, branch, and return addr.

predictors

Single cycle loop

Feedback from the backend

Frontend: Branch PredictionThree components in simple design Branch Target Buffer (BTB): What’s the branch target? Branch History Table (BHT): Is the branch taken or not? Return Address Stack (RAS)

• Function return is a special type of branch instruction• There are multiple valid branch targets for the return

How BTB and BHT works in general• Bet the same patterns will repeat• Use only PC and past branch outcome history in the prediction

Frontend: Branch PredictionBranch Target Buffer with combined Branch History Table

Branch PC Predicted PC

=?

PC of instructionFETCH

Extra prediction stateBits (see later)

Yes: instruction is branch and use predicted PC as next PC

No: branch not predicted, proceed normally

(Next PC = PC+4)From slides of CprE 581 Computer Systems Architecture

Frontend: Branch Prediction

First time fetching at BNE: Predicted as Not TakenLoop: LW R8, R4($0) ; load X[i], R4 stores X

ADD R8, R8, R6 ; X[i] = X[i] + bSW R8, R4($0) ; store X[i]ADDI R4, R4, 4 ; next elementSLT R9, R4, R5 ; end of array?BNE R9, R0, loop => mis-prediction on 1st fetch

-- -- 0-- -- 0-- -- 0-- -- 0-- -- 0-- -- 0

Branch PC Predicted PCLWADDSWADDISTLBNE

=> NT, right=> NT, right=> NT, right=> NT, right=> NT, right=> NT, WRONG


What happen after the mis-prediction1. The frontend starts fetch junk instructions, probably in dozens2. The backend detects the mis-prediction, flush backend

pipeline, notifies the frontend about the mis-predicted branch3. The frontend updates the BTB/BHT, filling in BNE-PC and

LW-PC, change prediction state bit4. The frontend restarts fetching from LW-PC

-- -- 0-- -- 0-- -- 0-- -- 0-- -- 0

BNE-PC LW-PC 1



2nd time fetching at BNE: Predicted as Taken, jump to LW-PCLoop:

LW R8, R4($0) ; load X[i], R4 stores XADD R8, R8, R6 ; X[i] = X[i] + bSW R8, R4($0) ; store X[i]ADDI R4, R4, 4 ; next elementSLT R9, R4, R5 ; end of array?

=> BNE R9, R0, loop ;

-- -- 0-- -- 0-- -- 0-- -- 0-- -- 0

BNE-PC LW-PC 1


=> NT, right=> NT, right=> NT, right=> NT, right=> NT, right=> Taken, RIGHT


Last time fetching at BNE-PC, predicted as Taken• It’s wrong because the loop will exit

This time, the prediction state bit is changed to 0• Next time the prediction outcome on BNE-PC is Not Taken

-- -- 0-- -- 0-- -- 0-- -- 0-- -- 0

BNE-PC LW-PC 0


16

Branch Prediction State Bit

state2. PredictOutput T/NT

1. Access

3. Feedback T/NT

T

Predict Taken Predict NotTaken

1 0T

NT

General Form

1-bit prediction

NT

PC

Feedback

From CprE 581, Computer Systems Architecture

Branch History TableBranch direction prediction is usually more challenging• BHT can be separated from BTB (often the case)• 2-bit or 3-bit state are usually used• BHT can be organized in two levels to predict on

correlation between branches• BHT can have sophisticated organizations to further

improve accuracy

Return Address Stack: Work on return instructions, simple and effective (not to be discussed more)

Frontend: Register RenamingConsider two loop iterations: Conflict on register usage, cannot be executed in parallel, but they are mostly parallel

LW R8, R4($0) ; load X[i], R4 stores XADD R8, R8, R6 ; X[i] = X[i] + bSW R8, R4($0) ; store X[i]ADDI R4, R4, 4 ; next elementSLT R9, R4, R5 ; end of array?BNE R9, R0, loop ;LW R8, R4($0) ; load X[i], R4 stores XADD R8, R8, R6 ; X[i] = X[i] + bSW R8, R4($0) ; store X[i]ADDI R4, R4, 4 ; next elementSLT R9, R4, R5 ; end of array?BNE R9, R0, loop ;

Frontend: Register RenamingRename architectural registers to physical registers, remove false dependence and keep true dep.

LW P32, P4($0) ; load X[i], R4 stores XADD P33, P32, P6 ; X[i] = X[i] + bSW P33, P4($0) ; store X[i]ADDI P34, P4, 4 ; next elementSLT P35, P34, P5 ; end of array?BNE P35, P0, loop ;LW P36, P34($0) ; load X[i], R4 stores XADD P37, P36, P6 ; X[i] = X[i] + bSW P37, P34($0) ; store X[i]ADDI P38, P34, 4 ; next elementSLT P38, P38, P5 ; end of array?BNE R38, p0, loop ;

Frontend: Register RenamingHow the design works:

• There is a register mapping table that maps architecture register to physical register

• There is a queue of free physical register• Every instruction with output register is assigned with an unused,

free physical register• Another mapping table is used to recover from mis-predicted path• There are a number of design variants in real processors

Frontend: Register RenamingThe roles of register renaming:

• Remove register name dependence, keep true data dependence, so that more instructions can be safely reordered

• Help backend implement speculative execution, as no junk instructions cannot affect the input of good instructions• A younger instruction writes to newly assigned physical register, so it

cannot affect the input of old instructions• A good instruction is always older than any junk instruction

Backend: Out-Of-Order SchedulingCommon Design: Issue Queue

LW P32 P4yes yes 0x0 yes

Op busy? dst src1 ready? src2 ready? ROB LSQ

ADD P33 P32yes no P6 yes

SW -- P33yes no P4 yes

ADDI P34 P4yes yes 0x4 yes

SLT P35 P34yes no P5 yes

BNE -- P35yes no P0 yes

1

2

3

4

5

6

1

-

2

-

-

-

Backend: Out-Of-Order SchedulingSchedule: Select ready instructions, broadcast their tag (dst) to all other instructions for matching

LW P32 P4yes yes 0x0 yes


ADD P33 P32yes no P6 yes


ADDI P34 P4yes yes 0x4 yes

SLT P35 P34yes no P5 yes


1

3

2

4

5

6

1

-

2

-

-

-

Backend: Out-Of-Order SchedulingAfter LW and ADDI are issued, assume no new instructions

-- -- --no -- -- --


ADD P33 P32yes yes P6 yes


-- -- ---- -- -- --

SLT P35 P34yes yes P5 yes


--

2

3

--

5

6

--

-

2

-

-

-

Backend: Out-Of-Order SchedulingAfter ADD and SLT are issued, assume no new instructions

-- -- --no -- -- --


-- -- --no -- -- --

SW -- P33yes yes P4 yes

-- -- ---- -- -- --

-- -- ---- -- -- --

BNE -- P35yes yes P0 yes

--

--

2

--

--

6

--

-

2

-

-

-

Backend: Out-Of-Order SchedulingHow the design works• Instructions are sent to the issue queue after renaming• A select logic chooses up to N instructions, all

dependence free, to be executed• The tag of the selected instructions are broadcast to all

other queue entries• A wakeup logic clears the dependence of other

instructions on the selected instructions

Two major design variants: Issue Queue vs. Reservation Station

Backend: Register Read, Data Forwarding and Writeback

Note: In reservation-station design, register-read happens before instruction scheduling

Issue Queue

Register File

Forwarding Network

LoadStore Int Mult

Div Other

Issue (scheduling)

Reg-Read

Execute

Writeback

28

Reorder Buffer and In-Order Commit

…

head tail

…

head tail

…

head tailfreed

allocated

Reorder Buffer and In-Order CommitInstructions enter and leave

ROB in program order

“Architectural Register State” changes in program order

Junk instructions may produce values, but their values never appear in the “Architectural Register State”• Junk instructions will be flushed

upon detection

29

Reorder Buffer

Dest arch regDest phy reg

Exceptions?

Program Counter

Branch or L/W?

Ready?

Recall the Renaming ExampleConsider two loop iterations: Rename architectural registers to physical registers, remove false dependence and keep true dep.

LW P32, P4($0) ; load X[i], R4 stores XADD P33, P32, P6 ; X[i] = X[i] + bSW P33, P4($0) ; store X[i]ADDI P34, P4, 4 ; next elementSLT P35, P34, P5 ; end of array?BNE P35, P0, loop ;LW P36, P34($0) ; load X[i], R4 stores XADD P37, P36, P6 ; X[i] = X[i] + bSW P37, P34($0) ; store X[i]ADDI P38, P34, 4 ; next elementSLT P38, P38, P5 ; end of array?BNE R38, p0, loop ;

Architectural Register StateLW R8, R4($0)ADD R8, R8, R6SW R8, R4($0)ADDI R4, R4, 4SLT R9, R4, R5BNE R9, R0, loopLW R8, R4($0)ADD R8, R8, R6SW R8, R4($0)ADDI R4, R4, 4SLT R9, R4, R5BNE R9, R0, loop

Mis-predictedpath

R0 R4 R5 R6 R8 R9P0 P4 P5 P6 P8 P9

R0 R4 R5 R6 R8 R9P0 P4 P5 P6 P8 P9

architectural register mapping

speculative register mapping

R0 R4 R5 R6 R8 R9P0 P4 P5 P6 P8 P9

R0 R4 R5 R6 R8 R9P0 P34 P5 P6 P33 P35



R0 R4 R5 R6 R8 R9P0 P34 P5 P6 P33 P35

R0 R4 R5 R6 R8 R9P0 P38 P5 P6 P37 P39



SummaryWhat we have learned• In-order frontend vs. out-of-order backend• Branch prediction to keep instruction flow• Register renaming to remove name dependence and

support speculative execution• Out-of-order scheduling with issue queue• In-order commit with re-order buffer

What we haven’t learned yet• Memory disambiguation using load/queue and store queue• Detail in complex real processors

SuperScalar Design Prime

Documents

Transcript of SuperScalar Design Prime