SuperScalar Design Prime

32
SUPERSCALAR DESIGN PRIME Zhao Zhang CprE 381, Computer Organization and Assembly-Level Programming, Fall 2012 Original slides from CprE 581, Advanced Computer Architecture

description

SuperScalar Design Prime. Zhao Zhang CprE 381, Computer Organization and Assembly-Level Programming, Fall 2012 Original slides from CprE 581, Advanced Computer Architecture. History Superscalar Design. First appearance in 1960s Scoreboarding Tomasulo Algorithm - PowerPoint PPT Presentation

Transcript of SuperScalar Design Prime

Page 1: SuperScalar  Design Prime

SUPERSCALAR DESIGN PRIMEZhao ZhangCprE 381, Computer Organization and Assembly-Level Programming, Fall 2012Original slides from CprE 581, Advanced Computer Architecture

Page 2: SuperScalar  Design Prime

History Superscalar DesignFirst appearance in 1960s

• Scoreboarding• Tomasulo Algorithm

Popular use since 1990s• SGI MIPS processors• Sun UltraSPARC• Dec Alpha 21x64 series• Intel/AMD processors

Now appearing in embedded processors• Cortex-A9: Two-way, limited out-of-order• Certex-A15: Three-way, close to Intel/AMD design

Page 3: SuperScalar  Design Prime

Why SuperscalarGet more performance than scalar pipeline

Superscalar Techniques: Deep pipelineMulti-issueBranch predictionRegister renamingOut-of-order ExecutionSpeculative ExecutionMemory disambiguation

Page 4: SuperScalar  Design Prime

Code Examplefor (i = 0; i < 1000; i++) X[i] = X[i] + b;

; loop body, initialization not shown; R4: &X[i], R5: (X+1000)*4, R6: bLoop: LW R8, R4($0) ; load X[i], R4 stores X

ADD R8, R8, R6 ; X[i] = X[i] + bSW R8, R4($0) ; store X[i]ADDI R4, R4, 4 ; next elementSLT R9, R4, R5 ; R9 = (R4 < R5)BNE R9, R0, loop ; end of loop?

Page 5: SuperScalar  Design Prime

Frontend and BackendFrontend: In-order fetch, decode, and renameBackend: Out-of-order issue, execute/writeback, in-order commit

Frontend may send “junk” instructions to the backend• Junk instructions occur with branch mis-prediction or exceptions• Design goal: Minimize the percentage of “junk” instructions

Backend must be able to detect and handle “junk” instructions• Flush junk instructions upon detetion• In-order commit (retire) so that junk instructions won’t affect the

“architectural state”• Dozens of cycles likely for handling a branch mis-prediction

Page 6: SuperScalar  Design Prime

Frontend and Backend

FrontendBackend

“Cortex-A9 Processor Microarchitecture”, slide 6

Page 7: SuperScalar  Design Prime

The Multi-Issue FactorMulti-issue affects all pipeline stages: In the same cycle,

• N inst. are fetched: Usually from one I-cache block• N inst. are decoded: Multiple decoders• N inst. are renamed: Multi-ported renaming table, detecting intra-

group dependence

In the backend• Up to N inst. are scheduled: Multi-ported queue with broadcast• N inst. read register file: Multi-ported register file• M inst. are executed at functional units: Multiple functional units• N inst. writes back register values: multi-ported register file• N inst. are committed: Multi-banked reorder buffer, also involves

rename tableNote: “N” is not necessary the same value across pipeline stages

Page 8: SuperScalar  Design Prime

Frontend: Branch PredictionBranch prediction is critical to reducing “junk” instructions

With “disaster” branch prediction performance:

SPECint programs have on average ~15% branches• Every 100 instructions contain 15 branches• Assume 10% mis-prediction => 1.5 branch mis-predictions• Assume 20-cycle mis-prediction penalty => 30 lost cycles• Assume IPC=3.0 => 33.3 cycles for execution 100 inst• 90% loss for the 10% mis-prediction• Mis-prediction penalty is workload-dependent, and can be

significantly longer than 20 cycles

good inst good inst good inst

Page 9: SuperScalar  Design Prime

Frontend: Branch PredictionBranch prediction is made every cycle• Otherwise, instruction flow

stops• It’s done in parallel with

instruction fetch

The backend sends back feedback about past predictions

Inst.Cache

Pred-PC

INST

Target, branch, and return addr.

predictors

Single cycle loop

Feedback from the backend

Page 10: SuperScalar  Design Prime

Frontend: Branch PredictionThree components in simple design Branch Target Buffer (BTB): What’s the branch target? Branch History Table (BHT): Is the branch taken or not? Return Address Stack (RAS)

• Function return is a special type of branch instruction• There are multiple valid branch targets for the return

How BTB and BHT works in general• Bet the same patterns will repeat• Use only PC and past branch outcome history in the prediction

Page 11: SuperScalar  Design Prime

Frontend: Branch PredictionBranch Target Buffer with combined Branch History Table

Branch PC Predicted PC

=?

PC of instructionFETCH

Extra prediction stateBits (see later)

Yes: instruction is branch and use predicted PC as next PC

No: branch not predicted, proceed normally

(Next PC = PC+4)From slides of CprE 581 Computer Systems Architecture

Page 12: SuperScalar  Design Prime

Frontend: Branch Prediction

First time fetching at BNE: Predicted as Not TakenLoop: LW R8, R4($0) ; load X[i], R4 stores X

ADD R8, R8, R6 ; X[i] = X[i] + bSW R8, R4($0) ; store X[i]ADDI R4, R4, 4 ; next elementSLT R9, R4, R5 ; end of array?BNE R9, R0, loop => mis-prediction on 1st fetch

-- -- 0-- -- 0-- -- 0-- -- 0-- -- 0-- -- 0

Branch PC Predicted PCLWADDSWADDISTLBNE

=> NT, right=> NT, right=> NT, right=> NT, right=> NT, right=> NT, WRONG

Page 13: SuperScalar  Design Prime

Frontend: Branch Prediction

What happen after the mis-prediction1. The frontend starts fetch junk instructions, probably in dozens2. The backend detects the mis-prediction, flush backend

pipeline, notifies the frontend about the mis-predicted branch3. The frontend updates the BTB/BHT, filling in BNE-PC and

LW-PC, change prediction state bit4. The frontend restarts fetching from LW-PC

-- -- 0-- -- 0-- -- 0-- -- 0-- -- 0

BNE-PC LW-PC 1

Branch PC Predicted PCLWADDSWADDISTLBNE

Page 14: SuperScalar  Design Prime

Frontend: Branch Prediction

2nd time fetching at BNE: Predicted as Taken, jump to LW-PCLoop:

LW R8, R4($0) ; load X[i], R4 stores XADD R8, R8, R6 ; X[i] = X[i] + bSW R8, R4($0) ; store X[i]ADDI R4, R4, 4 ; next elementSLT R9, R4, R5 ; end of array?

=> BNE R9, R0, loop ;

-- -- 0-- -- 0-- -- 0-- -- 0-- -- 0

BNE-PC LW-PC 1

Branch PC Predicted PCLWADDSWADDISTLBNE

=> NT, right=> NT, right=> NT, right=> NT, right=> NT, right=> Taken, RIGHT

Page 15: SuperScalar  Design Prime

Frontend: Branch Prediction

Last time fetching at BNE-PC, predicted as Taken• It’s wrong because the loop will exit

This time, the prediction state bit is changed to 0• Next time the prediction outcome on BNE-PC is Not Taken

-- -- 0-- -- 0-- -- 0-- -- 0-- -- 0

BNE-PC LW-PC 0

Branch PC Predicted PCLWADDSWADDISTLBNE

Page 16: SuperScalar  Design Prime

16

Branch Prediction State Bit

state2. PredictOutput T/NT

1. Access

3. Feedback T/NT

T

Predict Taken Predict NotTaken

1 0T

NT

General Form

1-bit prediction

NT

PC

Feedback

From CprE 581, Computer Systems Architecture

Page 17: SuperScalar  Design Prime

Branch History TableBranch direction prediction is usually more challenging• BHT can be separated from BTB (often the case)• 2-bit or 3-bit state are usually used• BHT can be organized in two levels to predict on

correlation between branches• BHT can have sophisticated organizations to further

improve accuracy

Return Address Stack: Work on return instructions, simple and effective (not to be discussed more)

Page 18: SuperScalar  Design Prime

Frontend: Register RenamingConsider two loop iterations: Conflict on register usage, cannot be executed in parallel, but they are mostly parallel

LW R8, R4($0) ; load X[i], R4 stores XADD R8, R8, R6 ; X[i] = X[i] + bSW R8, R4($0) ; store X[i]ADDI R4, R4, 4 ; next elementSLT R9, R4, R5 ; end of array?BNE R9, R0, loop ;LW R8, R4($0) ; load X[i], R4 stores XADD R8, R8, R6 ; X[i] = X[i] + bSW R8, R4($0) ; store X[i]ADDI R4, R4, 4 ; next elementSLT R9, R4, R5 ; end of array?BNE R9, R0, loop ;

Page 19: SuperScalar  Design Prime

Frontend: Register RenamingRename architectural registers to physical registers, remove false dependence and keep true dep.

LW P32, P4($0) ; load X[i], R4 stores XADD P33, P32, P6 ; X[i] = X[i] + bSW P33, P4($0) ; store X[i]ADDI P34, P4, 4 ; next elementSLT P35, P34, P5 ; end of array?BNE P35, P0, loop ;LW P36, P34($0) ; load X[i], R4 stores XADD P37, P36, P6 ; X[i] = X[i] + bSW P37, P34($0) ; store X[i]ADDI P38, P34, 4 ; next elementSLT P38, P38, P5 ; end of array?BNE R38, p0, loop ;

Page 20: SuperScalar  Design Prime

Frontend: Register RenamingHow the design works:

• There is a register mapping table that maps architecture register to physical register

• There is a queue of free physical register• Every instruction with output register is assigned with an unused,

free physical register• Another mapping table is used to recover from mis-predicted path• There are a number of design variants in real processors

Page 21: SuperScalar  Design Prime

Frontend: Register RenamingThe roles of register renaming:

• Remove register name dependence, keep true data dependence, so that more instructions can be safely reordered

• Help backend implement speculative execution, as no junk instructions cannot affect the input of good instructions• A younger instruction writes to newly assigned physical register, so it

cannot affect the input of old instructions• A good instruction is always older than any junk instruction

Page 22: SuperScalar  Design Prime

Backend: Out-Of-Order SchedulingCommon Design: Issue Queue

LW P32 P4yes yes 0x0 yes

Op busy? dst src1 ready? src2 ready? ROB LSQ

ADD P33 P32yes no P6 yes

SW -- P33yes no P4 yes

ADDI P34 P4yes yes 0x4 yes

SLT P35 P34yes no P5 yes

BNE -- P35yes no P0 yes

1

2

3

4

5

6

1

-

2

-

-

-

Page 23: SuperScalar  Design Prime

Backend: Out-Of-Order SchedulingSchedule: Select ready instructions, broadcast their tag (dst) to all other instructions for matching

LW P32 P4yes yes 0x0 yes

Op busy? dst src1 ready? src2 ready? ROB LSQ

ADD P33 P32yes no P6 yes

SW -- P33yes no P4 yes

ADDI P34 P4yes yes 0x4 yes

SLT P35 P34yes no P5 yes

BNE -- P35yes no P0 yes

1

3

2

4

5

6

1

-

2

-

-

-

Page 24: SuperScalar  Design Prime

Backend: Out-Of-Order SchedulingAfter LW and ADDI are issued, assume no new instructions

-- -- --no -- -- --

Op busy? dst src1 ready? src2 ready? ROB LSQ

ADD P33 P32yes yes P6 yes

SW -- P33yes no P4 yes

-- -- ---- -- -- --

SLT P35 P34yes yes P5 yes

BNE -- P35yes no P0 yes

--

2

3

--

5

6

--

-

2

-

-

-

Page 25: SuperScalar  Design Prime

Backend: Out-Of-Order SchedulingAfter ADD and SLT are issued, assume no new instructions

-- -- --no -- -- --

Op busy? dst src1 ready? src2 ready? ROB LSQ

-- -- --no -- -- --

SW -- P33yes yes P4 yes

-- -- ---- -- -- --

-- -- ---- -- -- --

BNE -- P35yes yes P0 yes

--

--

2

--

--

6

--

-

2

-

-

-

Page 26: SuperScalar  Design Prime

Backend: Out-Of-Order SchedulingHow the design works• Instructions are sent to the issue queue after renaming• A select logic chooses up to N instructions, all

dependence free, to be executed• The tag of the selected instructions are broadcast to all

other queue entries• A wakeup logic clears the dependence of other

instructions on the selected instructions

Two major design variants: Issue Queue vs. Reservation Station

Page 27: SuperScalar  Design Prime

Backend: Register Read, Data Forwarding and Writeback

Note: In reservation-station design, register-read happens before instruction scheduling

Issue Queue

Register File

Forwarding Network

LoadStore Int Mult

Div Other

Issue (scheduling)

Reg-Read

Execute

Writeback

Page 28: SuperScalar  Design Prime

28

Reorder Buffer and In-Order Commit

head tail

head tail

head tailfreed

allocated

Page 29: SuperScalar  Design Prime

Reorder Buffer and In-Order CommitInstructions enter and leave

ROB in program order

“Architectural Register State” changes in program order

Junk instructions may produce values, but their values never appear in the “Architectural Register State”• Junk instructions will be flushed

upon detection

29

Reorder Buffer

Dest arch regDest phy reg

Exceptions?

Program Counter

Branch or L/W?

Ready?

Page 30: SuperScalar  Design Prime

Recall the Renaming ExampleConsider two loop iterations: Rename architectural registers to physical registers, remove false dependence and keep true dep.

LW P32, P4($0) ; load X[i], R4 stores XADD P33, P32, P6 ; X[i] = X[i] + bSW P33, P4($0) ; store X[i]ADDI P34, P4, 4 ; next elementSLT P35, P34, P5 ; end of array?BNE P35, P0, loop ;LW P36, P34($0) ; load X[i], R4 stores XADD P37, P36, P6 ; X[i] = X[i] + bSW P37, P34($0) ; store X[i]ADDI P38, P34, 4 ; next elementSLT P38, P38, P5 ; end of array?BNE R38, p0, loop ;

Page 31: SuperScalar  Design Prime

Architectural Register StateLW R8, R4($0)ADD R8, R8, R6SW R8, R4($0)ADDI R4, R4, 4SLT R9, R4, R5BNE R9, R0, loopLW R8, R4($0)ADD R8, R8, R6SW R8, R4($0)ADDI R4, R4, 4SLT R9, R4, R5BNE R9, R0, loop

Mis-predictedpath

R0 R4 R5 R6 R8 R9P0 P4 P5 P6 P8 P9

R0 R4 R5 R6 R8 R9P0 P4 P5 P6 P8 P9

architectural register mapping

speculative register mapping

R0 R4 R5 R6 R8 R9P0 P4 P5 P6 P8 P9

R0 R4 R5 R6 R8 R9P0 P34 P5 P6 P33 P35

architectural register mapping

speculative register mapping

R0 R4 R5 R6 R8 R9P0 P34 P5 P6 P33 P35

R0 R4 R5 R6 R8 R9P0 P38 P5 P6 P37 P39

architectural register mapping

speculative register mapping

Page 32: SuperScalar  Design Prime

SummaryWhat we have learned• In-order frontend vs. out-of-order backend• Branch prediction to keep instruction flow• Register renaming to remove name dependence and

support speculative execution• Out-of-order scheduling with issue queue• In-order commit with re-order buffer

What we haven’t learned yet• Memory disambiguation using load/queue and store queue• Detail in complex real processors