Advanced Topic: High Performance Processors

149
Savio Chau Advanced Topic: High Performance Processors CS M151B Spring 02

description

Advanced Topic: High Performance Processors. CS M151B Spring 02. High Performance Processor Design Techniques. Main Idea: Exploit as much parallelism and hide as much overhead as possible Instruction Level Parallelism Scoreboarding Reservation Station (Tomasulo Algorithm) - PowerPoint PPT Presentation

Transcript of Advanced Topic: High Performance Processors

Page 1: Advanced Topic: High Performance Processors

Savio Chau

Advanced Topic:High Performance Processors

CS M151B Spring 02

Page 2: Advanced Topic: High Performance Processors

Savio Chau

High Performance Processor Design Techniques

Main Idea: Exploit as much parallelism and hide as much overhead as possible

• Instruction Level Parallelism– Scoreboarding– Reservation Station (Tomasulo Algorithm)– Dynamic Branch Prediction– Speculation Architecture– Multiple Instruction Issues (Super Scaler Processor)

• Vector Processors

• Digital Signal Processors

Page 3: Advanced Topic: High Performance Processors

Savio Chau

Hardware Approach to Instruction Parallelism

• Why in hardware at run time?– Works when can’t know real dependence at compile time– Compiler simpler– Code for one machine runs well on another

• Key idea: Allow instructions behind stall to proceedDIVD F0, F2, F4

ADDD F10, F0, F8

SUBD F12,F8,F14– Enables out-of-order execution => out-of-order completion– ID stage checked both for structural and data hazards

Page 4: Advanced Topic: High Performance Processors

Savio Chau

Three Generic Data Hazards

• Many high performance processor have multiple execution units and have multiple instructions executed at the same time. These processors have to handle three types of data hazards

• Read After Write (RAW): InstrI followed by InstrJ, InstrJ tries to read operand before InstrI writes it

• Write After Read (WAR) InstrI followed by InstrJ, InstrJ tries to write operand before InstrI reads it

– Gets wrong operand– Can’t happen in the simple 5-stage pipeline because:

• All instructions take 5 stages, and

• Reads are always in stage 2, and

• Writes are always in stage 5

• Write After Write (WAW) InstrI followed by InstrJ, InstrJ tries to write operand before InstrI writes it

– Leaves wrong result ( InstrI not InstrJ )

– Can’t happen in DLX 5 stage pipeline because: • All instructions take 5 stages, and

• Writes are always in stage 5

Page 5: Advanced Topic: High Performance Processors

Savio Chau

Scoreboarding

• Scoreboard dates to CDC 6600 in 1963• Out-of-order execution divides ID stage:

1. Issue—decode instructions, check for structural hazards

2. Read operands—wait until no data hazards, then read operands

• Scoreboards allow instruction to execute whenever 1 & 2 hold, not waiting for prior instructions

• CDC 6600: In order issue, out of order execution, out of order commit (also called completion)

Page 6: Advanced Topic: High Performance Processors

Savio Chau

Scoreboard Implications

• Out-of-order completion => WAR, WAW hazards?• Solutions for WAR

– Queue both the operation and copies of its operands– Read registers only during Read Operands stage

• For WAW, must detect hazard: stall until other completes

• Need to have multiple instructions in execution phase => multiple execution units or pipelined execution units

• Scoreboard keeps track of dependencies, state or operations

• Scoreboard replaces ID, EX, WB with 4 stages

Page 7: Advanced Topic: High Performance Processors

Savio Chau

Four Stages of Scoreboard Control

1. Issue—decode instructions & check for structural hazards (ID1) If a functional unit for the instruction is free and no other active instruction has the same destination register (WAW), the scoreboard issues the instruction to the functional unit and updates its internal data structure. If a structural or WAW hazard exists, then the instruction issue stalls, and no further instructions will issue until these hazards are cleared.

2. Read operands—wait until no data hazards, then read operands (ID2) A source operand is available if no earlier issued active instruction is going to write it, or if the register containing the operand is being written by a currently active functional unit. When the source operands are available, the scoreboard tells the functional unit to proceed to read the operands from the registers and begin execution. The scoreboard resolves RAW hazards dynamically in this step, and instructions may be sent into execution out of order.

Page 8: Advanced Topic: High Performance Processors

Savio Chau

Four Stages of Scoreboard Control

3. Execution—operate on operands (EX) The functional unit begins execution upon receiving operands.

When the result is ready, it notifies the scoreboard that it has completed execution.

4. Write result—finish execution (WB) Once the scoreboard is aware that the functional unit has

completed execution, the scoreboard checks for WAR hazards. If none, it writes results. If WAR, then it stalls the instruction.

Example:

DIVD F0,F2,F4

ADDD F10,F0,F8

SUBD F8,F8,F14

CDC 6600 scoreboard would stall SUBD until ADDD reads operands

Page 9: Advanced Topic: High Performance Processors

Savio Chau

Three Parts of the Scoreboard

1. Instruction status—which of 4 steps the instruction is in

2. Functional unit status—Indicates the state of the functional unit (FU). 9 fields for each functional unit

Busy—Indicates whether the unit is busy or not

Op—Operation to perform in the unit (e.g., + or –)

Fi—Destination register

Fj, Fk—Source-register numbers

Qj, Qk—Functional units producing source registers Fj, Fk

Rj, Rk—Flags indicating when Fj, Fk are ready

3. Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register

Page 10: Advanced Topic: High Performance Processors

Savio Chau

Detailed Scoreboard Pipeline Control

Read operands

Execution complete

Instruction status

Write result

Issue

Bookkeeping

Rj No; Rk No

f(if Qj(f)=FU then Rj(f) Yes);f(if Qk(f)=FU then Rj(f) Yes);

Result(Fi(FU)) 0; Busy(FU) No

Busy(FU) yes; Op(FU) op; Fi(FU) `D’; Fj(FU) `S1’;

Fk(FU) `S2’; Qj Result(‘S1’); Qk Result(`S2’); Rj not Qj; Rk not Qk; Result(‘D’) FU;

Rj and Rk

Functional unit done

Wait until

f((Fj( f )≠Fi(FU) or Rj( f )=No) &

(Fk( f ) ≠Fi(FU) or Rk( f )=No))

Not busy (FU) and not result(D)

Page 11: Advanced Topic: High Performance Processors

Savio Chau

Scoreboard Example

Instruction status Read ExecutionWriteInstruction j k Issue operandscompleteResultLD F6 34+ R2LD F2 45+ R3MULTDF0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDDF6 F8 F2Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger NoMult1 NoMult2 NoAdd NoDivide No

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F30FU

Page 12: Advanced Topic: High Performance Processors

Savio Chau

Scoreboard Example Cycle 1

Instruction status Read ExecutionWriteInstruction j k Issue operandscompleteResultLD F6 34+ R2 1LD F2 45+ R3MULTDF0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDDF6 F8 F2Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F6 R2 YesMult1 NoMult2 NoAdd NoDivide No

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F301 FU Integer

Page 13: Advanced Topic: High Performance Processors

Savio Chau

Scoreboard Example Cycle 2

Instruction status Read ExecutionWriteInstruction j k Issue operandscompleteResultLD F6 34+ R2 1 2LD F2 45+ R3MULTDF0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDDF6 F8 F2Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F6 R2 YesMult1 NoMult2 NoAdd NoDivide No

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F302 FU Integer

• Issue 2nd LD?

Page 14: Advanced Topic: High Performance Processors

Savio Chau

Scoreboard Example Cycle 3

Instruction status Read ExecutionWriteInstruction j k Issue operandscompleteResultLD F6 34+ R2 1 2 3LD F2 45+ R3MULTDF0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDDF6 F8 F2Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F6 R2 YesMult1 NoMult2 NoAdd NoDivide No

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F303 FU Integer

• Issue MULT?

Page 15: Advanced Topic: High Performance Processors

Savio Chau

Scoreboard Example Cycle 4

Instruction status Read ExecutionWriteInstruction j k Issue operandscompleteResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3MULTDF0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDDF6 F8 F2Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F6 R2 YesMult1 NoMult2 NoAdd NoDivide No

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F304 FU Integer

Page 16: Advanced Topic: High Performance Processors

Savio Chau

Scoreboard Example Cycle 5

Instruction status Read ExecutionWriteInstruction j k Issue operandscompleteResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5MULTDF0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDDF6 F8 F2Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F2 R3 YesMult1 NoMult2 NoAdd NoDivide No

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F305 FU Integer

Page 17: Advanced Topic: High Performance Processors

Savio Chau

Scoreboard Example Cycle 6

Instruction status Read ExecutionWriteInstruction j k Issue operandscompleteResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6MULTDF0 F2 F4 6SUBD F8 F6 F2DIVD F10 F0 F6ADDDF6 F8 F2Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F2 R3 YesMult1 Yes Mult F0 F2 F4 Integer No YesMult2 NoAdd NoDivide No

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F306 FU Mult1 Integer

Page 18: Advanced Topic: High Performance Processors

Savio Chau

Scoreboard Example Cycle 7

Instruction status Read ExecutionWriteInstruction j k Issue operandscompleteResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7MULTDF0 F2 F4 6SUBD F8 F6 F2 7DIVD F10 F0 F6ADDDF6 F8 F2Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F2 R3 YesMult1 Yes Mult F0 F2 F4 Integer No YesMult2 NoAdd Yes Sub F8 F6 F2 Integer Yes NoDivide No

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F307 FU Mult1 Integer Add

• Read multiply operands?

Page 19: Advanced Topic: High Performance Processors

Savio Chau

Scoreboard Example Cycle 8a

Instruction status Read ExecutionWriteInstruction j k Issue operandscompleteResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7MULTDF0 F2 F4 6SUBD F8 F6 F2 7DIVD F10 F0 F6 8ADDDF6 F8 F2Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F2 R3 YesMult1 Yes Mult F0 F2 F4 Integer No YesMult2 NoAdd Yes Sub F8 F6 F2 Integer Yes NoDivide Yes Div F10 F0 F6 Mult1 No Yes

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F308 FU Mult1 Integer Add Divide

Page 20: Advanced Topic: High Performance Processors

Savio Chau

Scoreboard Example Cycle 8b

Instruction status Read ExecutionWriteInstruction j k Issue operandscompleteResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTDF0 F2 F4 6SUBD F8 F6 F2 7DIVD F10 F0 F6 8ADDDF6 F8 F2Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger NoMult1 Yes Mult F0 F2 F4 Yes YesMult2 NoAdd Yes Sub F8 F6 F2 Yes YesDivide Yes Div F10 F0 F6 Mult1 No Yes

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F308 FU Mult1 Add Divide

Page 21: Advanced Topic: High Performance Processors

Savio Chau

Scoreboard Example Cycle 9

Instruction status Read ExecutionWriteInstruction j k Issue operandscompleteResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTDF0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDDF6 F8 F2Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger No

10 Mult1 Yes Mult F0 F2 F4 Yes YesMult2 No

2 Add Yes Sub F8 F6 F2 Yes YesDivide Yes Div F10 F0 F6 Mult1 No Yes

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F309 FU Mult1 Add Divide

• Read operands for MULT & SUBD? Issue ADDD?

Page 22: Advanced Topic: High Performance Processors

Savio Chau

Scoreboard Example Cycle 11

Instruction status Read ExecutionWriteInstruction j k Issue operandscompleteResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTDF0 F2 F4 6 9SUBD F8 F6 F2 7 9 11DIVD F10 F0 F6 8ADDDF6 F8 F2Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger No

8 Mult1 Yes Mult F0 F2 F4 Yes YesMult2 No

0 Add Yes Sub F8 F6 F2 Yes YesDivide Yes Div F10 F0 F6 Mult1 No Yes

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F3011 FU Mult1 Add Divide

Page 23: Advanced Topic: High Performance Processors

Savio Chau

Scoreboard Example Cycle 13

Instruction status Read ExecutionWriteInstruction j k Issue operandscompleteResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTDF0 F2 F4 6 9SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8ADDDF6 F8 F2 13Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger No

6 Mult1 Yes Mult F0 F2 F4 Yes YesMult2 NoAdd Yes Add F6 F8 F2 Yes YesDivide Yes Div F10 F0 F6 Mult1 No Yes

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F3013 FU Mult1 Add Divide

Page 24: Advanced Topic: High Performance Processors

Savio Chau

Scoreboard Example Cycle 14

Instruction status Read ExecutionWriteInstruction j k Issue operandscompleteResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTDF0 F2 F4 6 9SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8ADDDF6 F8 F2 13 14Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger No

5 Mult1 Yes Mult F0 F2 F4 Yes YesMult2 No

2 Add Yes Add F6 F8 F2 Yes YesDivide Yes Div F10 F0 F6 Mult1 No Yes

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F3014 FU Mult1 Add Divide

Page 25: Advanced Topic: High Performance Processors

Savio Chau

Scoreboard Example Cycle 15

Instruction status Read ExecutionWriteInstruction j k Issue operandscompleteResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTDF0 F2 F4 6 9SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8ADDDF6 F8 F2 13 14Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger No

4 Mult1 Yes Mult F0 F2 F4 Yes YesMult2 No

1 Add Yes Add F6 F8 F2 Yes YesDivide Yes Div F10 F0 F6 Mult1 No Yes

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F3015 FU Mult1 Add Divide

Page 26: Advanced Topic: High Performance Processors

Savio Chau

Scoreboard Example Cycle 16

Instruction status Read ExecutionWriteInstruction j k Issue operandscompleteResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTDF0 F2 F4 6 9SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8ADDDF6 F8 F2 13 14 16Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger No

3 Mult1 Yes Mult F0 F2 F4 Yes YesMult2 No

0 Add Yes Add F6 F8 F2 Yes YesDivide Yes Div F10 F0 F6 Mult1 No Yes

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F3016 FU Mult1 Add Divide

Page 27: Advanced Topic: High Performance Processors

Savio Chau

Scoreboard Example Cycle 17

Instruction status Read ExecutionWriteInstruction j k Issue operandscompleteResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTDF0 F2 F4 6 9SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8ADDDF6 F8 F2 13 14 16Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger No

2 Mult1 Yes Mult F0 F2 F4 Yes YesMult2 NoAdd Yes Add F6 F8 F2 Yes YesDivide Yes Div F10 F0 F6 Mult1 No Yes

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F3017 FU Mult1 Add Divide

• Write result of ADDD?

Page 28: Advanced Topic: High Performance Processors

Savio Chau

Scoreboard Example Cycle 18

Instruction status Read ExecutionWriteInstruction j k Issue operandscompleteResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTDF0 F2 F4 6 9SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8ADDDF6 F8 F2 13 14 16Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger No

1 Mult1 Yes Mult F0 F2 F4 Yes YesMult2 NoAdd Yes Add F6 F8 F2 Yes YesDivide Yes Div F10 F0 F6 Mult1 No Yes

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F3018 FU Mult1 Add Divide

Page 29: Advanced Topic: High Performance Processors

Savio Chau

Scoreboard Example Cycle 20

Instruction status Read ExecutionWriteInstruction j k Issue operandscompleteResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTDF0 F2 F4 6 9 19 20SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8ADDDF6 F8 F2 13 14 16Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger NoMult1 NoMult2 NoAdd Yes Add F6 F8 F2 Yes YesDivide Yes Div F10 F0 F6 Yes Yes

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F3020 FU Add Divide

Page 30: Advanced Topic: High Performance Processors

Savio Chau

Scoreboard Example Cycle 21

Instruction status Read ExecutionWriteInstruction j k Issue operandscompleteResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTDF0 F2 F4 6 9 19 20SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8 21ADDDF6 F8 F2 13 14 16Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger NoMult1 NoMult2 NoAdd Yes Add F6 F8 F2 Yes YesDivide Yes Div F10 F0 F6 Yes Yes

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F3021 FU Add Divide

Page 31: Advanced Topic: High Performance Processors

Savio Chau

Scoreboard Example Cycle 22

Instruction status Read ExecutionWriteInstruction j k Issue operandscompleteResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTDF0 F2 F4 6 9 19 20SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8 21ADDDF6 F8 F2 13 14 16 22Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger NoMult1 NoMult2 NoAdd No

40 Divide Yes Div F10 F0 F6 Yes YesRegister result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F3022 FU Divide

Page 32: Advanced Topic: High Performance Processors

Savio Chau

Scoreboard Example Cycle 61

Instruction status Read ExecutionWriteInstruction j k Issue operandscompleteResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTDF0 F2 F4 6 9 19 20SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8 21 61ADDDF6 F8 F2 13 14 16 22Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger NoMult1 NoMult2 NoAdd No

0 Divide Yes Div F10 F0 F6 Yes YesRegister result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F3061 FU Divide

Page 33: Advanced Topic: High Performance Processors

Savio Chau

Scoreboard Example Cycle 62

Instruction status Read ExecutionWriteInstruction j k Issue operandscompleteResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTDF0 F2 F4 6 9 19 20SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8 21 61 62ADDDF6 F8 F2 13 14 16 22Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger NoMult1 NoMult2 NoAdd No

0 Divide NoRegister result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F3062 FU

Page 34: Advanced Topic: High Performance Processors

Savio Chau

CDC 6600 Scoreboard

• Speedup 1.7 from compiler; 2.5 by hand BUT slow memory (no cache) limits benefit

• Limitations of 6600 scoreboard:– No forwarding hardware– Limited to instructions in basic block (small window)– Small number of functional units (structural hazards),

especailly integer/load store units– Do not issue on structural hazards– Wait for WAR hazards– Prevent WAW hazards

Page 35: Advanced Topic: High Performance Processors

Savio Chau

Another Dynamic Approach: Tomasulo Algorithm

• For IBM 360/91 about 3 years after CDC 6600 (1966)• Goal: High Performance without special compilers• Differences between IBM 360 & CDC 6600 ISA

– IBM has only 2 register specifiers/instr vs. 3 in CDC 6600– IBM has 4 FP registers vs. 8 in CDC 6600

• Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604, …

Page 36: Advanced Topic: High Performance Processors

Savio Chau

Tomasulo Algorithm vs. Scoreboard

• Control & buffers distributed with Function Units (FU) vs. centralized in scoreboard; – FU buffers called “reservation stations”; have pending

operands• Registers in instructions replaced by values or pointers to

reservation stations(RS); called register renaming ; – avoids WAR, WAW hazards– More reservation stations than registers, so can do

optimizations compilers can’t• Results to FU from RS, not through registers, over Common

Data Bus that broadcasts results to all FUs• Load and Stores treated as FUs with RSs as well• Integer instructions can go past branches, allowing

FP ops beyond basic block in FP queue

Page 37: Advanced Topic: High Performance Processors

Savio Chau

LoadBuffer

FPRegisters

FP Op Queue

StoreBuffer

FP AddRes.Station

FP MulRes.Station

CommonDataBus

Tomasulo Organization

Page 38: Advanced Topic: High Performance Processors

Savio Chau

Reservation Station Components

Op—Operation to perform in the unit (e.g., + or –)

Vj, Vk—Value of Source operands– Store buffers has V field, result to be stored

Qj, Qk—Reservation stations producing source registers (value to be written)– Note: No ready flags as in Scoreboard; Qj,Qk=0 =>

ready– Store buffers only have Qi for RS producing result

Busy—Indicates reservation station or FU is busy

Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.

Page 39: Advanced Topic: High Performance Processors

Savio Chau

Three Stages of Tomasulo Algorithm

1. Issue—get instruction from FP Op Queue If reservation station free (no structural hazard),

control issues instr & sends operands (renames registers).

2. Execution—operate on operands (EX) When both operands ready then execute;

if not ready, watch Common Data Bus for result

3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting units;

mark reservation station available

• Normal data bus: data + destination (“go to” bus)• Common data bus: data + source (“come from” bus)

– 64 bits of data + 4 bits of Functional Unit source address– Write if matches expected Functional Unit (produces result)– Does the broadcast

Page 40: Advanced Topic: High Performance Processors

Savio Chau

Tomasulo Example Cycle 0

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 Load1 NoLD F2 45+ R3 Load2 NoMULTDF0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDDF6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No0 Add3 No0 Mult1 No0 Mult2 No

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F300 FU

Page 41: Advanced Topic: High Performance Processors

Savio Chau

Tomasulo Example Cycle 1

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 Load1 No 34+R2LD F2 45+ R3 Load2 NoMULTDF0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDDF6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No

Add3 No0 Mult1 No0 Mult2 No

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F301 FU Load1

Yes

Page 42: Advanced Topic: High Performance Processors

Savio Chau

Tomasulo Example Cycle 2

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTDF0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDDF6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No

Add3 No0 Mult1 No0 Mult2 No

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F302 FU Load2 Load1

Note: Unlike 6600, can have multiple loads outstanding

Page 43: Advanced Topic: High Performance Processors

Savio Chau

Tomasulo Example Cycle 3

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTDF0 F2 F4 3 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDDF6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No

Add3 No0 Mult1 Yes MULTD R(F4) Load20 Mult2 No

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F303 FU Mult1 Load2 Load1

• Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued vs. scoreboard

• Load1 completing; what is waiting for Load1?

Page 44: Advanced Topic: High Performance Processors

Savio Chau

Tomasulo Example Cycle 4

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 Load2 Yes 45+R3MULTDF0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6ADDDF6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 Yes SUBD M(34+R2) Load20 Add2 No

Add3 No0 Mult1 Yes MULTD R(F4) Load20 Mult2 No

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F304 FU Mult1 Load2 M(34+R2) Add1

• Load2 completing; what is waiting for it?

Page 45: Advanced Topic: High Performance Processors

Savio Chau

Tomasulo Example Cycle 5

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTDF0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDDF6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk2 Add1 Yes SUBD M(34+R2) M(45+R3)0 Add2 No

Add3 No10 Mult1 Yes MULTD M(45+R3) R(F4)

0 Mult2 Yes DIVD M(34+R2) Mult1Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F305 FU Mult1 M(45+R3) M(34+R2) Add1 Mult2

Page 46: Advanced Topic: High Performance Processors

Savio Chau

Tomasulo Example Cycle 6

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTDF0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDDF6 F8 F2 6Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk1 Add1 Yes SUBD M(34+R2) M(45+R3)0 Add2 Yes ADDD M(45+R3) Add1

Add3 No9 Mult1 Yes MULTD M(45+R3) R(F4)0 Mult2 Yes DIVD M(34+R2) Mult1

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F306 FU Mult1 M(45+R3) Add2 Add1 Mult2

• Issue ADDD here vs. scoreboard?

Page 47: Advanced Topic: High Performance Processors

Savio Chau

Tomasulo Example Cycle 7

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTDF0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7DIVD F10 F0 F6 5ADDDF6 F8 F2 6Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 Yes SUBD M(34+R2) M(45+R3)0 Add2 Yes ADDD M(45+R3) Add1

Add3 No8 Mult1 Yes MULTD M(45+R3) R(F4)0 Mult2 Yes DIVD M(34+R2) Mult1

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F307 FU Mult1 M(45+R3) Add2 Add1 Mult2

• Add1 completing; what is waiting for it?

Page 48: Advanced Topic: High Performance Processors

Savio Chau

Tomasulo Example Cycle 8

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTDF0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No 2 Add2 Yes ADDD M()-M() M(45+R3)0 Add3 No7 Mult1 Yes MULTD M(45+R3) R(F4)0 Mult2 Yes DIVD M(34+R2) Mult1

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F308 FU Mult1 M(45+R3) Add2 M()-M() Mult2

Page 49: Advanced Topic: High Performance Processors

Savio Chau

Tomasulo Example Cycle 9

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTDF0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDDF6 F8 F2 6Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No1 Add2 Yes ADDD M()–M() M(45+R3)0 Add3 No6 Mult1 Yes MULTD M(45+R3) R(F4)0 Mult2 Yes DIVD M(34+R2) Mult1

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F309 FU Mult1 M(45+R3) Add2 M()–M() Mult2

Page 50: Advanced Topic: High Performance Processors

Savio Chau

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTDF0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDDF6 F8 F2 6 10Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 Yes ADDD M()–M() M(45+R3)0 Add3 No5 Mult1 Yes MULTD M(45+R3) R(F4)0 Mult2 Yes DIVD M(34+R2) Mult1

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F3010 FU Mult1 M(45+R3) Add2 M()–M() Mult2

Tomasulo Example Cycle 10

• Add2 completing; what is waiting for it?

Page 51: Advanced Topic: High Performance Processors

Savio Chau

Tomasulo Example Cycle 11

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTDF0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No0 Add3 No4 Mult1 Yes MULTD M(45+R3) R(F4)0 Mult2 Yes DIVD M(34+R2) Mult1

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F3011 FU Mult1 M(45+R3) (M-M)+M() M()–M() Mult2

• Write result of ADDD here vs. scoreboard?

Page 52: Advanced Topic: High Performance Processors

Savio Chau

Tomasulo Example Cycle 12

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTDF0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 6 7DIVD F10 F0 F6 5ADDDF6 F8 F2 6 10 11Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No0 Add3 No3 Mult1 Yes MULTD M(45+R3) R(F4)0 Mult2 Yes DIVD M(34+R2) Mult1

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F3012 FU Mult1 M(45+R3) (M-M)+M() M()–M() Mult2

• Note: all quick instructions complete already

Page 53: Advanced Topic: High Performance Processors

Savio Chau

Tomasulo Example Cycle 13

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTDF0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDDF6 F8 F2 6 10 11Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No

Add3 No2 Mult1 Yes MULTD M(45+R3) R(F4)0 Mult2 Yes DIVD M(34+R2) Mult1

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F3013 FU Mult1 M(45+R3) (M–M)+M() M()–M() Mult2

Page 54: Advanced Topic: High Performance Processors

Savio Chau

Tomasulo Example Cycle 14

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTDF0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDDF6 F8 F2 6 10 11Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No0 Add3 No1 Mult1 Yes MULTD M(45+R3) R(F4)0 Mult2 Yes DIVD M(34+R2) Mult1

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F3014 FU Mult1 M(45+R3) (M–M)+M() M()–M() Mult2

Page 55: Advanced Topic: High Performance Processors

Savio Chau

Tomasulo Example Cycle 15

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTDF0 F2 F4 3 15 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDDF6 F8 F2 6 10 11Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No

Add3 No0 Mult1 Yes MULTD M(45+R3) R(F4)0 Mult2 Yes DIVD M(34+R2) Mult1

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F3015 FU Mult1 M(45+R3) (M–M)+M() M()–M() Mult2

• Mult1 completing; what is waiting for it?

Page 56: Advanced Topic: High Performance Processors

Savio Chau

Tomasulo Example Cycle 16

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTDF0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDDF6 F8 F2 6 10 11Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No

Add3 No0 Mult1 No

40 Mult2 Yes DIVD M*F4 M(34+R2)Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F3016 FU M*F4 M(45+R3) (M–M)+M() M()–M() Mult2

• Note: Just waiting for divide

Page 57: Advanced Topic: High Performance Processors

Savio Chau

Tomasulo Example Cycle 55

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTDF0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDDF6 F8 F2 6 10 11Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No

Add3 No0 Mult1 No1 Mult2 Yes DIVD M*F4 M(34+R2)

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F3055 FU M*F4 M(45+R3) (M–M)+M() M()–M() Mult2

Page 58: Advanced Topic: High Performance Processors

Savio Chau

Tomasulo Example Cycle 56

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTDF0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56ADDDF6 F8 F2 6 10 11Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No

Add3 No0 Mult1 No0 Mult2 Yes DIVD M*F4 M(34+R2)

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F3056 FU M*F4 M(45+R3) (M–M)+M() M()–M() Mult2

• Mult 2 completing; what is waiting for it?

Page 59: Advanced Topic: High Performance Processors

Savio Chau

Tomasulo Example Cycle 57

Instruction status Execution WriteInstruction j k Issue complete Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTDF0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56 57ADDDF6 F8 F2 6 10 11Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No

Add3 No0 Mult1 No0 Mult2 No

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F3057 FU M*F4 M(45+R3) (M–M)+M() M()–M() M*F4/M

• Again, in-oder issue, out-of-order execution, completion

Page 60: Advanced Topic: High Performance Processors

Savio Chau

Compare to Scoreboard Cycle 62

Instruction status Read ExecutionWriteInstruction j k Issue operandscompleteResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTDF0 F2 F4 6 9 19 20SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8 21 61 62ADDDF6 F8 F2 13 14 16 22Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger NoMult1 NoMult2 NoAdd No

0 Divide NoRegister result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F3062 FU

• Why takes longer on Scoreboard/6600?

Page 61: Advanced Topic: High Performance Processors

Savio Chau

Tomasulo v. Scoreboard(IBM 360/91 v. CDC 6600)

Pipelined Functional Units Multiple Functional Units

(6 load, 3 store, 3 +, 2 x/÷) (1 load/store, 1 + , 2 x, 1 ÷)

window size: ≤ 14 instructions ≤ 5 instructions

No issue on structural hazard same

WAR: renaming avoids stall completion

WAW: renaming avoids stall completion

Broadcast results from FU Write/read registers

Control: reservation stations central scoreboard

Page 62: Advanced Topic: High Performance Processors

Savio Chau

Tomasulo Drawbacks

• Complexity– delays of 360/91, MIPS 10000, IBM 620?

• Many associative stores (CDB) at high speed• Performance limited by Common Data Bus

– Multiple CDBs => more FU logic for parallel assoc stores

Page 63: Advanced Topic: High Performance Processors

Savio Chau

Dynamic Branch Prediction

• Performance = ƒ(accuracy, cost of misprediction)• Branch History Table uses lower bits of PC address

index table of 1-bit values– Says whether or not branch taken last time– No address check

• Problem: in a loop, 1-bit BHT will cause two mispredictions (avg is 9 iteratios before exit):

– End of loop case, when it exits instead of looping as before– First time through loop on next time through code, when it

predicts exit instead of looping

Page 64: Advanced Topic: High Performance Processors

Savio Chau

Dynamic Branch Prediction

• Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 4.13, p. 264)

• Red: stop, not taken• Green: go, taken

State 00predict taken

State 01predict taken

State 10predict not taken

State 11predict not taken

taken not taken

taken

not taken

taken

not taken

taken

Page 65: Advanced Topic: High Performance Processors

Savio Chau

Branch History Table Accuracy

• Mispredict because either:– Wrong guess for that branch– Got branch history of wrong branch when index the table

• 4096 entry table programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12%

• 4096 about as good as infinite table(in Alpha 211164)

Page 66: Advanced Topic: High Performance Processors

Savio Chau

Correlating Branches

• Hypothesis: recent branches are correlated; that is, behavior of recently executed branches affects prediction of current branch

• Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table

• In general, (m,n) predictor means record last m branches to select between 2m history talbes each with n-bit counters

– Old 2-bit BHT is then a (0,2) predictor

Page 67: Advanced Topic: High Performance Processors

Savio Chau

Correlating Branches

(2,2) predictor– Then behavior of recent

branches selects between, say, four predictions of next branch, updating just that prediction

PredictionPrediction

Page 68: Advanced Topic: High Performance Processors

Savio Chau

Selective History Predictor

8096 x 2 bits

2048 x 4 x 2 bits

Branch Addr

GlobalHistory

2

00011011

Taken/Not Taken

8K x 2 bitSelector

11100100

Choose Non-correlator

Choose Correlator

10

11 Taken1001 Not Taken00

Page 69: Advanced Topic: High Performance Processors

Savio Chau

Fre

qu

en

cy

of

Mis

pre

dic

tio

ns

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

nasa

7

matr

ix300

tom

catv

doducd

spic

e

fpppp

gcc

esp

ress

o

eqnto

tt li

0%

1%

5%

6% 6%

11%

4%

6%

5%

1%

4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)

Accuracy of Different Schemes(Figure 4.21, p. 272)

4096 Entries 2-bit BHTUnlimited Entries 2-bit BHT1024 Entries (2,2) BHT

0%

18%

Fre

qu

ency

of

Mis

pre

dic

tio

ns

Page 70: Advanced Topic: High Performance Processors

Savio Chau

Need Address at Same Time as Prediction• Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken)

– Note: must check for branch match now, since can’t use wrong branch address (Figure 4.22, p. 273)

• Return instruction addresses predicted with stack

Predicted PCBranch Prediction:Taken or not Taken

Page 71: Advanced Topic: High Performance Processors

Savio Chau

Dynamic Branch Prediction Summary

• Branch History Table: 2 bits for loop accuracy• Correlation: Recently executed branches correlated

with next branch• Branch Target Buffer: include branch address &

prediction• Predicated Execution can reduce number of

branches, number of mispredicted branches

Page 72: Advanced Topic: High Performance Processors

Savio Chau

Speculation

• Speculation: allow an instructionwithout any consequences (including exceptions) if branch is not actually taken (“HW undo”); called “boosting”

• Combine branch prediction with dynamic scheduling to execute before branches resolved

• Separate speculative bypassing of results from real bypassing of results– When instruction no longer speculative,

write boosted results (instruction commit)or discard boosted results

– execute out-of-order but commit in-order to prevent irrevocable action (update state or exception) until instruction commits

Page 73: Advanced Topic: High Performance Processors

Savio Chau

Hardware Support for Speculation

• Need HW buffer for results of uncommitted instructions: reorder buffer– 3 fields: instr, destination, value

– Reorder buffer can be operand source => more registers like RS

– Use reorder buffer number instead of reservation station when execution completes

– Supplies operands between execution complete & commit

– Once operand commits, result is put into register

– Instructionscommit

– As a result, its easy to undo speculated instructions on mispredicted branches or on exceptions

ReorderBuffer

FP Regs

FPOp

Queue

FP Adder FP Adder

Res Stations Res Stations

Page 74: Advanced Topic: High Performance Processors

Savio Chau

Four Steps of Speculative Tomasulo Algorithm

1. Issue—get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr &

send operands & reorder buffer no. for destination (this stage sometimes called “dispatch”)

2.Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch

CDB for result; when both in reservation station, execute; checks RAW (sometimes called “issue”)

3.Write result—finish execution (WB) Write on Common Data Bus to all awaiting FUs

& reorder buffer; mark reservation station available.

4.Commit—update register with reorder result When instr. at head of reorder buffer & result present, update

register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called “graduation”)

Page 75: Advanced Topic: High Performance Processors

Savio Chau

Renaming Registers

• Common variation of speculative design• Reorder buffer keeps instruction information

but not the result• Extend register file with extra

renaming registers to hold speculative results• Rename register allocated at issue;

result into rename register on execution complete; rename register into real register on commit

• Operands read either from register file (real or speculative) or via Common Data Bus

• Advantage: operands are always from single source (extended register file)

Page 76: Advanced Topic: High Performance Processors

Savio Chau

Issuing Multiple Instructions/Cycle

• Two variations• Superscalar: varying no. instructions/cycle (1 to 8),

scheduled by compiler or by HW (Tomasulo)– IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000

• (Very) Long Instruction Words (V)LIW: fixed number of instructions (4-16) scheduled by the compiler; put ops into wide templates

– Joint HP/Intel agreement in 1999/2000?– Intel Architecture-64 (IA-64) 64-bit address– Style: “Explicitly Parallel Instruction Computer (EPIC)”

• Anticipated success lead to use of Instructions Per Clock cycle (IPC) vs. CPI

Page 77: Advanced Topic: High Performance Processors

Savio Chau

Issuing Multiple Instructions/Cycle

• Superscalar DLX: 2 instructions, 1 FP & 1 anything else– Fetch 64-bits/clock cycle; Int on left, FP on right

– Can only issue 2nd instruction if 1st instruction issues

– More ports for FP registers to do FP load & FP op in a pair

Type Pipe Stages

Int. instruction IF ID EX MEM WB

FP instruction IF ID EX MEM WB

Int. instruction IF ID EX MEM WB

FP instruction IF ID EX MEM WB

Int. instruction IF ID EX MEM WB

FP instruction IF ID EX MEM WB

• 1 cycle load delay expands to 3 instructions in SS– instruction in right half can’t use it, nor instructions in next slot

Page 78: Advanced Topic: High Performance Processors

Savio Chau

Loop Unrolling in Superscalar

Integer instruction FP instruction Clock cycle

Loop: LD F0,0(R1) 1

LD F6,-8(R1) 2

LD F10,-16(R1) ADDD F4,F0,F2 3

LD F14,-24(R1) ADDD F8,F6,F2 4

LD F18,-32(R1) ADDD F12,F10,F2 5

SD 0(R1),F4 ADDD F16,F14,F2 6

SD -8(R1),F8 ADDD F20,F18,F2 7

SD -16(R1),F12 8

SD -24(R1),F16 9

SUBI R1,R1,#40 10

BNEZ R1,LOOP 11

SD -32(R1),F20 12

• Unrolled 5 times to avoid delays (+1 due to SS)• 12 clocks, or 2.4 clocks per iteration (1.5X)

Page 79: Advanced Topic: High Performance Processors

Savio Chau

Multiple Issue Challenges

• While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with:

– Exactly 50% FP operations– No hazards

• If more instructions issue at same time, greater difficulty of decode and issue

– Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue

• VLIW: tradeoff instruction space for simple decoding– The long instruction word has room for many operations– By definition, all the operations the compiler puts in the long

instruction word are independent => execute in parallel– E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch

• 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide

– Need compiling technique that schedules across several branches

Page 80: Advanced Topic: High Performance Processors

Savio Chau

Loop Unrolling in VLIWMemory Memory FP FP Int. op/ Clockreference 1 reference 2 operation 1 op. 2 branchLD F0,0(R1) LD F6,-8(R1) 1

LD F10,-16(R1) LD F14,-24(R1) 2

LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 3

LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4

ADDD F20,F18,F2 ADDD F24,F22,F2 5

SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6

SD -16(R1),F12 SD -24(R1),F16 7

SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,#48 8

SD -0(R1),F28 BNEZ R1,LOOP 9

Unrolled 7 times to avoid delays

7 results in 9 clocks, or 1.3 clocks per iteration (1.8X)

Average: 2.5 ops per clock, 50% efficiency

Note: Need more registers in VLIW (15 vs. 6 in SS)

Page 81: Advanced Topic: High Performance Processors

Savio Chau

Trace Scheduling

• Parallelism across IF branches vs. LOOP branches• Two steps:

– Trace Selection• Find likely sequence of basic blocks (trace)

of (statically predicted or profile predicted) long sequence of straight-line code

– Trace Compaction• Squeeze trace into few VLIW instructions

• Need bookkeeping code in case prediction is wrong

• Compiler undoes bad guess (discards values in registers)

• Subtle compiler bugs mean wrong answer vs. pooer performance; no hardware interlocks

Page 82: Advanced Topic: High Performance Processors

Savio Chau

Advantages of HW (Tomasulo) vs. SW (VLIW) Speculation

• HW determines address conflicts• HW better branch prediction• HW maintains precise exception model• HW does not execute bookkeeping instructions• Works across multiple implementations• SW speculation is much easier for HW design

Page 83: Advanced Topic: High Performance Processors

Savio Chau

Superscalar v. VLIW

Superscalar• Smaller code size• Binary compatability

across generations of hardware

VLIW• Simplified Hardware for

decoding, issuing instructions

• No Interlock Hardware (compiler checks?)

• More registers, but simplified Hardware for Register Ports (multiple independent register files?)

Page 84: Advanced Topic: High Performance Processors

Savio Chau

Dynamic Scheduling in Superscalar

• Dependencies stop instruction issue• Code compiler for old version will run poorly on newest version

– May want code to vary depending on how superscalar

• How to issue two instructions and keep in-order instruction issue for Tomasulo?

– Assume 1 integer + 1 floating point

– 1 Tomasulo control for integer, 1 for floating point

• Issue 2X Clock Rate, so that issue remains in order• Only FP loads might cause dependency between integer and FP

issue:– Replace load reservation station with a load queue;

operands must be read in the order they are fetched

– Load checks addresses in Store Queue to avoid RAW violation

– Store checks addresses in Load Queue to avoid WAR,WAW

– Called “decoupled architecture”

Page 85: Advanced Topic: High Performance Processors

Savio Chau

Performance of Dynamic Superscaler SchedulingIteration Instructions Issues ExecutesWrites result

no. clock-cycle number

1 LD F0,0(R1) 1 2 4

1 ADDD F4,F0,F2 1 5 8

1 SD 0(R1),F4 2 9

1 SUBI R1,R1,#8 3 4 5

1 BNEZ R1,LOOP 4 5

2 LD F0,0(R1) 5 6 8

2 ADDD F4,F0,F2 5 9 12

2 SD 0(R1),F4 6 13

2 SUBI R1,R1,#8 7 8 9

2 BNEZ R1,LOOP 8 9

4 clocks per iteration; only 1 FP instr/iteration

Branches, Decrements issues still take 1 clock cycle

How to get more performance?

Page 86: Advanced Topic: High Performance Processors

Savio Chau

Software Pipelining

• Observation: if iterations from loops are independent, then can get more ILP by taking instructions from different iterations

• Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop ( Tomasulo in SW)

Iteration 0 Iteration

1 Iteration 2 Iteration

3 Iteration 4

Software- pipelined iteration

Page 87: Advanced Topic: High Performance Processors

Savio Chau

Software Pipelining ExampleBefore: Unrolled 3 times 1 LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 4 LD F6,-8(R1) 5 ADDD F8,F6,F2 6 SD -8(R1),F8 7 LD F10,-16(R1) 8 ADDD F12,F10,F2 9 SD -16(R1),F12 10 SUBI R1,R1,#24 11 BNEZ R1,LOOP

After: Software Pipelined 1 SD 0(R1),F4 ; Stores M[i] 2 ADDD F4,F0,F2 ; Adds to

M[i-1] 3 LD F0,-16(R1);Loads M[i-

2] 4 SUBI R1,R1,#8 5 BNEZ R1,LOOP

• Symbolic Loop Unrolling– Maximize result-use distance – Less code space than unrolling– Fill & drain pipe only once per loop vs. once per each unrolled iteration in loop unrolling

SW Pipeline

Loop Unrolled

ove

rlap

ped

op

s

Time

Time

Page 88: Advanced Topic: High Performance Processors

Savio Chau

Limits to Multi-Issue Machines

• Inherent limitations of ILP– 1 branch in 5: How to keep a 5-way VLIW busy?– Latencies of units: many operations must be scheduled– Need about Pipeline Depth x No. Functional Units of

independentDifficulties in building HW– Easy: More instruction bandwidth– Easy: Duplicate FUs to get parallel execution– Hard: Increase ports to Register File (bandwidth)

• VLIW example needs 7 read and 3 write for Int. Reg. & 5 read and 3 write for FP reg

– Harder: Increase ports to memory (bandwidth)– Decoding Superscalar and impact on clock rate, pipeline depth?

Page 89: Advanced Topic: High Performance Processors

Savio Chau

Limits to Multi-Issue Machines

• Limitations specific to either Superscalar or VLIW implementation

– Decode issue in Superscalar: how wide practical?– VLIW code size: unroll loops + wasted fields in VLIW

• IA-64 compresses dependent instructions, but still larger

– VLIW lock step => 1 hazard & all instructions stall• IA-64 not lock step? Dynamic pipeline?

– VLIW & binary compatibilityIA-64 promises binary compatibility

Page 90: Advanced Topic: High Performance Processors

Savio Chau

Limits to ILP

• Conflicting studies of amount– Benchmarks (vectorized Fortran FP vs. integer C programs)– Hardware sophistication– Compiler sophistication

• How much ILP is available using existing mechanims with increasing HW budgets?

• Do we need to invent new HW/SW mechanisms to keep on processor performance curve?

Page 91: Advanced Topic: High Performance Processors

Savio Chau

Limits to ILP

Initial HW Model here; MIPS compilers.

Assumptions for ideal/perfect machine to start:

1. Register renaming–infinite virtual registers and all WAW & WAR hazards are avoided

2. Branch prediction–perfect; no mispredictions

3. Jump prediction–all jumps perfectly predicted => machine with perfect speculation & an unbounded buffer of instructions available

4. Memory-address alias analysis–addresses are known & a store can be moved before a load provided addresses not equal

1 cycle latency for all instructions; unlimited number of instructions issued per clock cycle

Page 92: Advanced Topic: High Performance Processors

Savio Chau

Intel/HP “Explicitly Parallel Instruction Computer (EPIC)”

• 3 Instructions in 128 bit “groups”; field determines if instructions dependent or independent

– Smaller code size than old VLIW, larger than x86/RISC– Groups can be linked to show independence > 3 instr

• 64 integer registers + 64 floating point registers– Not separate filesper funcitonal unit as in old VLIW

• Hardware checks dependencies (interlocks => binary compatibility over time)

• Predicated execution (select 1 out of 64 1-bit flags) => 40% fewer mispredictions?

• IA-64 : name of instruction set architecture; EPIC is type• Merced is name of first implementation (1999/2000?)• LIW = EPIC?

Page 93: Advanced Topic: High Performance Processors

Savio Chau

Dynamic Scheduling in PowerPC 604 and Pentium Pro

• Both In-order Issue, Out-of-order execution, In-order Commit

Pentium Pro more like a scoreboard since central control vs. distributed

Page 94: Advanced Topic: High Performance Processors

Savio Chau

Dynamic Scheduling in PowerPC 604 and Pentium Pro

Parameter PPC PPro

Max. instructions issued/clock 4 3

Max. instr. complete exec./clock 6 5

Max. instr. commited/clock 6 3

Window (Instrs in reorder buffer) 16 40

Number of reservations stations 12 20

Number of rename registers 8int/12FP 40

No. integer functional units (FUs) 2 2No. floating point FUs 1 1 No. branch FUs 1 1 No. complex integer FUs 1 0No. memory FUs 1 1 load +1 store

Q: How pipeline 1 to 17 byte x86 instructions?

Page 95: Advanced Topic: High Performance Processors

Savio Chau

Dynamic Scheduling in Pentium Pro• PPro doesn’t pipeline 80x86 instructions• PPro decode unit translates the Intel instructions into

72-bit micro-operations ( DLX)• Sends micro-operations to reorder buffer & reservation

stations• Takes 1 clock cycle to determine length of 80x86

instructions + 2 more to create the micro-operations• 12-14 clocks in total pipeline ( 3 state machines)• Many instructions translate to 1 to 4 micro-operations• Complex 80x86 instructions are executed by a

conventional microprogram (8K x 72 bits) that issues long sequences of micro-operations

Page 96: Advanced Topic: High Performance Processors

Savio Chau

Problems with Instruction Level Parallelism

• Limits to conventional exploitation of ILP:

1) pipelined clock rate: at some point, each increase in clock rate has corresponding CPI increase (branches, other hazards)

2) instruction fetch and decode: at some point, its hard to fetch and decode more instructions per clock cycle

3) cache hit rate: some long-running (scientific) programs have very large data sets accessed with poor locality; others have continuous data streams (multimedia) and hence poor locality

Page 97: Advanced Topic: High Performance Processors

Savio Chau

Alternative Model: Vector Processing

+

r1 r2

r3

add r3, r1, r2

SCALAR(1 operation)

v1 v2

v3

+

vectorlength

add.vv v3, v1, v2

VECTOR(N operations)

• Vector processors have high-level operations that work on linear arrays of numbers: "vectors"

Page 98: Advanced Topic: High Performance Processors

Savio Chau

Properties of Vector Processors

Each result independent of previous result=> long pipeline, compiler ensures no dependencies=> high clock rate

• Vector instructions access memory with known pattern=> highly interleaved memory=> amortize memory latency of over 64 elements=> no (data) caches required! (Do use instruction cache)

• Reduces branches and branch problems in pipelines• Single vector instruction implies lots of work ( loop)

=> fewer instruction fetches

Page 99: Advanced Topic: High Performance Processors

Savio Chau

Styles of Vector Architectures

• memory-memory vector processors: all vector operations are memory to memory

• vector-register processors: all vector operations between vector registers (except load and store)

– Vector equivalent of load-store architectures– Includes all vector machines since late 1980s:

Cray, Convex, Fujitsu, Hitachi, NEC– We assume vector-register for rest of lectures

Page 100: Advanced Topic: High Performance Processors

Savio Chau

Components of Vector Processor

• Vector Register: fixed length bank holding a single vector– has at least 2 read and 1 write ports– typically 8-32 vector registers, each holding 64-128 64-bit

elements

• Vector Functional Units (FUs): fully pipelined, start new operation every clock

– typically 4 to 8 FUs: FP add, FP mult, FP reciprocal (1/X), integer add, logical, shift; may have multiple of same unit

• Vector Load-Store Units (LSUs): fully pipelined unit to load or store a vector; may have multiple LSUs

• Scalar registers: single element for FP scalar or address• Cross-bar to connect FUs , LSUs, registers

Page 101: Advanced Topic: High Performance Processors

Savio Chau32

Memory operations

• Load/store operations move groups of data between registers and memory

• Three types of addressing– Unit stride

• Fastest

– Non-unit (constant) stride– Indexed (gather-scatter)

• Vector equivalent of register indirect• Good for sparse arrays of data• Increases number of programs that vectorize

Page 102: Advanced Topic: High Performance Processors

Savio Chau

Example of Vector Instruction (Y = a * X + Y)

LD F0,a

ADDI R4,Rx,#512 ;last address to load

loop: LD F2, 0(Rx) ;load X(i)

MULTD F2,F0,F2 ;a*X(i)

LD F4, 0(Ry) ;load Y(i)

ADDD F4,F2, F4 ;a*X(i) + Y(i)

SD F4 ,0(Ry) ;store into Y(i)

ADDI Rx,Rx,#8 ;increment index to X

ADDI Ry,Ry,#8 ;increment index to Y

SUB R20,R4,Rx ;compute bound

BNZ R20,loop ;check if done

LD F0,a ;load scalar a

LV V1,Rx ;load vector X

MULTS V2,F0,V1 ;vector-scalar mult.

LV V3,Ry ;load vector Y

ADDV V4,V2,V3 ;add

SV Ry,V4 ;store the result

Assuming vectors X, Y are length 64

Scalar vs. Vector

578 (2+9*64) vs.321 (1+5*64) ops (1.8X)

578 (2+9*64) vs. 6 instructions (96X)

64 operation vectors + no loop overhead

also 64X fewer pipeline hazards

Page 103: Advanced Topic: High Performance Processors

Savio Chau

Vector Surprise

• Use vectors for inner loop parallelism (no surprise)– One dimension of array: A[0, 0], A[0, 1], A[0, 2], ... – think of machine as, say, 32 vector regs each with 64 elements– 1 instruction updates 64 elements of 1 vector register

• and for outer loop parallelism! – 1 element from each column: A[0,0], A[1,0], A[2,0], ...– think of machine as 64 “virtual processors” (VPs)

each with 32 scalar registers! ( multithreaded processor)– 1 instruction updates 1 scalar register in 64 VPs

• Hardware identical, just 2 compiler perspectives

Page 104: Advanced Topic: High Performance Processors

Savio Chau

Virtial Processor Vector Model

• Vector operations are SIMD (single instruction multiple data)operations

• Each element is computed by a virtual processor (VP)• Number of VPs given by vector length

– vector control register

Page 105: Advanced Topic: High Performance Processors

Savio Chau

Vector Architectural State

GeneralPurpose

Registers

FlagRegisters

(32)

VP0 VP1 VP$vlr-1

vr0

vr1

vr31

vf0

vf1

vf31

$vdw bits

1 bit

Virtual Processors ($vlr)

vcr0

vcr1

vcr31

ControlRegisters

32 bits

Page 106: Advanced Topic: High Performance Processors

Savio Chau

Vector Implementation

• Vector register file– Each register is an array of elements– Size of each register determines maximum

vector length– Vector length register determines vector length

for a particular operation

• Multiple parallel execution units = “lanes”(sometimes called “pipelines” or “pipes”)

Page 107: Advanced Topic: High Performance Processors

Savio Chau

Vector Terminology: 4 lanes, 2 vector functional units

(VectorFunctionalUnit)

Page 108: Advanced Topic: High Performance Processors

Savio Chau

Vector Execution Time

• Time = f(vector length, data dependicies, struct. hazards) • Initiation rate: rate that FU consumes vector elements

(= number of lanes; usually 1 or 2 on Cray T-90)• Convoy: set of vector instructions that can begin execution in

same clock (no struct. or data hazards)• Chime: approx. time for a vector operation• m convoys take m chimes; if each vector length is n, then

they take approx. m x n clock cycles (ignores overhead; good approximization for long vectors)

4 conveys, 1 lane, VL=64=> 4 x 64 256 clocks(or 4 clocks per result)

1: LV V1,Rx ;load vector X

2: MULV V2,F0,V1 ;vector-scalar mult.

LV V3,Ry ;load vector Y

3: ADDV V4,V2,V3 ;add

4: SV Ry,V4 ;store the result

Page 109: Advanced Topic: High Performance Processors

Savio Chau

Example of Vector Instruction Start-up Time

• Start-up time: pipeline latency time (depth of FU pipeline); another sources of overhead

• Operation Start-up penalty (from CRAY-1)

• Vector load/store 12• Vector multply 7• Vector add 6

Assume convoys don't overlap; vector length = n:Convoy Start 1st result last result

1. LV 0 12 11+n (12+n-1)

2. MULV, LV 12+n 12+n+12 23+2n Load start-up

3. ADDV 24+2n 24+2n+6 29+3n Wait convoy 2

4. SV 30+3n 30+3n+12 41+4n Wait convoy 3

Page 110: Advanced Topic: High Performance Processors

Savio Chau

Why startup time for each vector instruction?

• Why not overlap startup time of back-to-back vector instructions?

• Cray machines built from many ECL chips operating at high clock rates; hard to do?

• Berkeley vector design (“T0”) didn’t know it wasn’t supposed to do overlap, so no startup times for functional units (except load)

Page 111: Advanced Topic: High Performance Processors

Savio Chau

Vector Load/Store Units & Memories

• Start-up overheads usually longer fo LSUs• Memory system must sustain (# lanes x word) /clock cycle• Many Vector Procs. use banks (vs. simple interleaving):

1) support multiple loads/stores per cycle => multiple banks & address banks independently

2) support non-sequential accesses (see soon)• Note: No. memory banks > memory latency to avoid stalls

– m banks => m words per memory lantecy l clocks– if m < l, then gap in memory pipeline:

clock: 0 … l l+1 l+2 … l+m- 1 l+m … 2 l

word: -- … 0 1 2 … m-1 -- … m– may have 1024 banks in SRAM

Page 112: Advanced Topic: High Performance Processors

Savio Chau

Vector Length

• What to do when vector length is not exactly 64? • vector-length register (VLR) controls the length of any

vector operation, including a vector load or store. (cannot be > the length of vector registers)

do 10 i = 1, n

10 Y(i) = a * X(i) + Y(i)• Don't know n until runtime!

n > Max. Vector Length (MVL)?

Page 113: Advanced Topic: High Performance Processors

Savio Chau

Strip Mining

• Suppose Vector Length > Max. Vector Length (MVL)?• Strip mining: generation of code such that each vector

operation is done for a size Š to the MVL• 1st loop do short piece (n mod MVL), rest VL = MVL

low = 1 VL = (n mod MVL) /*find the odd size piece*/ do 1 j = 0,(n / MVL) /*outer loop*/

do 10 i = low,low+VL-1 /*runs for length VL*/Y(i) = a*X(i) + Y(i) /*main operation*/

10 continuelow = low+VL /*start of next vector*/VL = MVL /*reset the length to max*/

1 continue

Page 114: Advanced Topic: High Performance Processors

Savio Chau

Common Vector Metrics

• R �: MFLOPS rate on an infinite-length vector

– vector “speed of light”– Real problems do not have unlimited vector lengths, and the

start-up penalties encountered in real problems will be larger

– (Rn is the MFLOPS rate for a vector of length n)

• N1/2: The vector length needed to reach one-half of R �

– a good measure of the impact of start-up

• NV: The vector length needed to make vector mode

faster than scalar mode – measures both start-up and speed of scalars relative to

vectors, quality of connection of scalar unit to vector unit

Page 115: Advanced Topic: High Performance Processors

Savio Chau

Vector Stride

• Suppose adjacent elements not sequential in memory

do 10 i = 1,100

do 10 j = 1,100

A(i,j) = 0.0

do 10 k = 1,100

10 A(i,j) = A(i,j)+B(i,k)*C(k,j)• Either B or C accesses not adjacent (800 bytes between)

• stride: distance separating elements that are to be merged into a single vector (caches do unit stride) => LVWS (load vector with stride) instruction

• Strides => can cause bank conflicts (e.g., stride = 32 and 16 banks)

• Think of address per vector element

Page 116: Advanced Topic: High Performance Processors

Savio Chau

Compiler Vectorization on Cray XMP

• Benchmark %FP %FP in vector• ADM 23% 68%• DYFESM 26% 95%• FLO52 41% 100%• MDG 28% 27%• MG3D 31% 86%• OCEAN 28% 58%• QCD 14% 1%• SPICE 16% 7% (1%

overall)• TRACK 9% 23%• TRFD 22% 10%

Page 117: Advanced Topic: High Performance Processors

Savio Chau

Vector Opt #1: Chaining

• Suppose:

MULV V1,V2,V3

ADDV V4,V1,V5 ; separate convoy?• chaining: vector register (V1) is not as a single entity but as

a group of individual registers, then pipeline forwarding can work on individual elements of a vector

• Flexible chaining: allow vector to chain to any other active vector operation => more read/write port

• As long as enough HW, increases convoy size

Page 118: Advanced Topic: High Performance Processors

Savio Chau

Example Execution of Vector CodeVector

Memory PipelineVector

Multiply PipelineVector

Adder Pipeline

8 lanes, vector length 32,chaining

Scalar

Page 119: Advanced Topic: High Performance Processors

Savio Chau

Vector Opt #2: Conditional Execution

• Suppose:

do 100 i = 1, 64

if (A(i) .ne. 0) then

A(i) = A(i) – B(i)

endif

100 continue• vector-mask control takes a Boolean vector: when

vector-mask register is loaded from vector test, vector instructions operate only on vector elements whose corresponding entries in the vector-mask register are 1.

• Still requires clock even if result not stored; if still performs operation, what about divide by 0?

Page 120: Advanced Topic: High Performance Processors

Savio Chau

Vector Opt #3: Sparse Matrices

• Suppose:

do 100 i = 1,n

100 A(K(i)) = A(K(i)) + C(M(i))• gather (LVI) operation takes an index vector and fetches

the vector whose elements are at the addresses given by adding a base address to the offsets given in the index vector => a nonsparse vector in a vector register

• After these elements are operated on in dense form, the sparse vector can be stored in expanded form by a scatter store (SVI), using the same index vector

• Can't be done by compiler since can't know Ki elements distinct, no dependencies; by compiler directive

• Use CVI to create index 0, 1xm, 2xm, ..., 63xm

Page 121: Advanced Topic: High Performance Processors

Savio Chau

Sparse Matrix Example

• Cache (1993) vs. Vector (1988)

IBM RS6000 Cray YMP

Clock 72 MHz 167 MHz

Cache 256 KB 0.25 KB

Linpack 140 MFLOPS 160 (1.1)

Sparse Matrix 17 MFLOPS 125 (7.3)(Cholesky Blocked )

• Cache: 1 address per cache block (32B to 64B)• Vector: 1 address per element (4B)

Page 122: Advanced Topic: High Performance Processors

Savio Chau

Applications

Limited to scientific computing?• Multimedia Processing (compress., graphics, audio synth,

image proc.)

• Standard benchmark kernels (Matrix Multiply, FFT,

Convolution, Sort)• Lossy Compression (JPEG, MPEG video and audio)• Lossless Compression (Zero removal, RLE, Differencing,

LZW)• Cryptography (RSA, DES/IDEA, SHA/MD5)• Speech and handwriting recognition• Operating systems/Networking (memcpy, memset, parity,

checksum)• Databases (hash/join, data mining, image/video serving)• Language run-time support (stdlib, garbage collection)• even SPECint95

Page 123: Advanced Topic: High Performance Processors

Savio Chau

Vector for Multimedia?

• Intel MMX: 57 new 80x86 instructions (1st since 386)– similar to Intel 860, Mot. 88110, HP PA-71000LC, UltraSPARC

• 3 data types: 8 8-bit, 4 16-bit, 2 32-bit in 64bits– reuse 8 FP registers (FP and MMX cannot mix)

• short vector: load, add, store 8 8-bit operands

• Claim: overall speedup 1.5 to 2X for 2D/3D graphics, audio, video, speech, comm., ...

– use in drivers or added to library routines; no compiler

+

Page 124: Advanced Topic: High Performance Processors

Savio Chau

MMX Instructions

• Move 32b, 64b• Add, Subtract in parallel: 8 8b, 4 16b, 2 32b

– opt. signed/unsigned saturate (set to max) if overflow

• Shifts (sll,srl, sra), And, And Not, Or, Xor in parallel: 8 8b, 4 16b, 2 32b

• Multiply, Multiply-Add in parallel: 4 16b• Compare = , > in parallel: 8 8b, 4 16b, 2 32b

– sets field to 0s (false) or 1s (true); removes branches

• Pack/Unpack– Convert 32b<–> 16b, 16b <–> 8b– Pack saturates (set to max) if number is too large

Page 125: Advanced Topic: High Performance Processors

Savio Chau

Vectors and Variable Data Width

• Programmer thinks in terms of vectors of data of some width (8, 16, 32, or 64 bits)

• Good for multimedia; More elegant than MMX-style extensions

• Don’t have to worry about how data stored in hardware

– No need for explicit pack/unpack operations

• Just think of more virtual processors operating on narrow data

• Expand Maximum Vector Length with decreasing data width: 64 x 64bit, 128 x 32 bit, 256 x 16 bit, 512 x 8 bit

Page 126: Advanced Topic: High Performance Processors

Savio Chau

Mediaprocesing: Vectorizable? Vector Lengths?

Kernel Vector length

• Matrix transpose/multiply # vertices at once• DCT (video, communication) image width• FFT (audio) 256-1024• Motion estimation (video) image width, iw/16• Gamma correction (video) image width• Haar transform (media mining) image width• Median filter (image processing) image width• Separable convolution (img. proc.) image width

(from Pradeep Dubey - IBM,http://www.research.ibm.com/people/p/pradeep/tutor.html)

Page 127: Advanced Topic: High Performance Processors

Savio Chau

Vector Pitfalls

• Pitfall: Concentrating on peak performance and ignoring start-up overhead: NV (length faster than scalar) > 100!

• Pitfall: Increasing vector performance, without comparable increases in scalar performance (Amdahl's Law)

– failure of Cray competitor from his former company

• Pitfall: Good processor vector performance without providing good memory bandwidth

– MMX?

Page 128: Advanced Topic: High Performance Processors

Savio Chau

Vector Advantages

• Easy to get high performance; N operations:– are independent

– use same functional unit

– access disjoint registers

– access registers in same order as previous instructions

– access contiguous memory words or known pattern

– can exploit large memory bandwidth

– hide memory latency (and any other latency)

• Scalable (get higher performance as more HW resources available)• Compact: Describe N operations with 1 short instruction (v. VLIW)• Predictable (real-time) performance vs. statistical performance (cache)• Multimedia ready: choose N * 64b, 2N * 32b, 4N * 16b, 8N * 8b• Mature, developed compiler technology• Vector Disadvantage: Out of Fashion

Page 129: Advanced Topic: High Performance Processors

Savio Chau

Vector Summary

• Alternate model accomodates long memory latency, doesn’t rely on caches as does Out-Of-Order, superscalar/VLIW designs

• Much easier for hardware: more powerful instructions, more predictable memory accesses, fewer harzards, fewer branches, fewer mispredicted branches, ...

• What % of computation is vectorizable? • Is vector a good match to new apps such as

multidemia, DSP?

Page 130: Advanced Topic: High Performance Processors

Savio Chau

Digital Signal Processor (DSP) Introduction

• Digital Signal Processing: application of mathematical operations to digitally represented signals

• Signals represented digitally as sequences of samples• Digital signals obtained from physical signals via

tranducers (e.g., microphones) and analog-to-digital converters (ADC)

• Digital signals converted back to physical signals via digital-to-analog converters (DAC)

• Digital Signal Processor (DSP): electronic system that processes digital signals

Page 131: Advanced Topic: High Performance Processors

Savio Chau

Common DSP algorithms and applications

• Applications – Instrumentation and measurement – Communications

– Audio and video processing

– Graphics, image enhancement, 3- D rendering

– Navigation, radar, GPS

– Control - robotics, machine vision, guidance

• Algorithms – Frequency domain filtering - FIR and IIR

– Frequency- time transformations - FFT

– Correlation

Page 132: Advanced Topic: High Performance Processors

Savio Chau

What Do DSPs Need to Do Well?

• Most DSP tasks require: – Repetitive numeric computations – Attention to numeric fidelity – High memory bandwidth, mostly via array accesses – Real-time processing

• DSPs must perform these tasks efficiently while minimizing:

– Cost – Power – Memory use – Development time

Page 133: Advanced Topic: High Performance Processors

Savio Chau

DSP vs. General Purpose MPU

• DSPs tend to be written for 1 program, not many programs.

– Hence OSes are much simpler, there is no virtual memory or protection, ...

• DSPs sometimes run hard real-time apps– You must account for anything that could happen in a time

slot – All possible interrupts or exceptions must be accounted for

and their collective time be subtracted from the time interval. – Therefore, exceptions are BAD!

• DSPs have an infinite continuous data stream

Page 134: Advanced Topic: High Performance Processors

Savio Chau

First Generation DSP (1982): Texas Instruments TMS32010

• 16-bit fixed-point • “Harvard architecture”

– separate instruction, data memories

• Accumulator • Specialized instruction set

– Load and Accumulate

• 390 ns Multiple-Accumulate (MAC) time; 228 ns today

Processor

InstructionMemory

DataMemory

T-Register

Accumulator

ALU

Multiplier

Datapath:

P-Register

Mem

Page 135: Advanced Topic: High Performance Processors

Savio Chau

Features Common to Most DSP Processors

• Data path configured for DSP • Specialized instruction set • Multiple memory banks and buses • Specialized addressing modes • Specialized execution control • Specialized peripherals for DSP

Page 136: Advanced Topic: High Performance Processors

Savio Chau

DSP Data Path: Arithmetic

• DSPs dealing with numbers representing real world=> Want “reals”/ fractions

• DSPs dealing with numbers for addresses=> Want integers

• Support “fixed point” as well as integers

S.radix point

-1 Š x < 1

S .radix point

–2N–1 Š x < 2N–1

Page 137: Advanced Topic: High Performance Processors

Savio Chau

DSP Data Path: Precision

• Word size affects precision of fixed point numbers• DSPs have 16-bit, 20-bit, or 24-bit data words• Floating Point DSPs cost 2X - 4X vs. fixed point,

slower than fixed point• DSP programmers will scale values inside code

– SW Libraries– Seperate explicit exponent

• “Blocked Floating Point” single exponent for a group of fractions

• Floating point support simplify development

Page 138: Advanced Topic: High Performance Processors

Savio Chau

DSP Data Path: Overflow?

• DSP are descended from analog : what should happen to output when “peg” an input? (e.g., turn up volume control knob on stereo)

– Modulo Arithmetic???

• Set to most positive (2N–1–1) or most negative value(–2N–1) : “saturation”

• Many algorithms were developed in this model

Page 139: Advanced Topic: High Performance Processors

Savio Chau

DSP Data Path: Multiplier

• Specialized hardware performs all key arithmetic operations in 1 cycle

• 50% of instructions can involve multiplier=> single cycle latency multiplier

• Need to perform multiply-accumulate (MAC)• n-bit multiplier => 2n-bit product

Page 140: Advanced Topic: High Performance Processors

Savio Chau

DSP Data Path: Accumulator

• Don’t want overflow or have to scale accumulator• Option 1: accumalator wider than product:

“guard bits”– Motorola DSP:

24b x 24b => 48b product, 56b Accumulator

• Option 2: shift right and round product before adder

Accumulator

ALU

Multiplier

Accumulator

ALU

Multiplier

Shift

G

Page 141: Advanced Topic: High Performance Processors

Savio Chau

DSP Data Path: Rounding

• Even with guard bits, will need to round when store accumulator into memory

• 3 DSP standard options• Truncation: chop results

=> biases results up• Round to nearest:

< 1/2 round down, 1/2 round up (more positive)=> smaller bais

• Convergent: < 1/2 round down, > 1/2 round up (more positive), = 1/2 round to make lsb a zero (+1 if 1, +0 if 0)=> no baisIEEE 754 calls this round to nearest even

Page 142: Advanced Topic: High Performance Processors

Savio Chau

DSP Memory

• FIR Tap implies multiple memory accesses• DSPs want multiple data ports• Some DSPs have ad hoc techniques to reduce

memory bandwdith demand– Instruction repeat buffer: do 1 instruction 256 times– Often disables interrupts, thereby increasing interrupt

responce time

• Some recent DSPs have instruction caches– Even then may allow programmer to “lock in” instructions

into cache– Option to turn cache into fast program memory

• No DSPs have data caches• May have multiple data memories

Page 143: Advanced Topic: High Performance Processors

Savio Chau

DSP Addressing

• Have standard addressing modes: immediate, displacement, register indirect

• Want to keep MAC datapth busy• Assumption: any extra instructions imply clock cycles

of overhead in inner loop=> complex addressing is good=> don’t use datapath to calculate fancy address

• Autoincrement/Autodecrement register indirect– lw r1,0(r2)+ => r1 <- M[r2]; r2<-r2+1– Option to do it before addressing, positive or negative

Page 144: Advanced Topic: High Performance Processors

Savio Chau

DSP Addressing: Buffers

• DSPs dealing with continuous I/O• Often interact with an I/O buffer (delay lines)• To save memory, buffer often organized as circular

buffer• What can do to avoid overhead of address checking

instructions for circular buffer?• Option 1: Keep start register and end register per

address register for use with autoincrement addressing, reset to start when reach end of buffer

• Option 2: Keep a buffer length register, assuming buffers starts on aligned address, reset to start when reach end

• Every DSP has “modulo” or “circular” addressing

Page 145: Advanced Topic: High Performance Processors

Savio Chau

DSP Addressing: FFT

• FFTs start or end with data in wierd bufferfly order0 (000) => 0 (000)

1 (001) => 4 (100)

2 (010) => 2 (010)

3 (011) => 6 (110)

4 (100) => 1 (001)

5 (101) => 5 (101)

6 (110) => 3 (011)

7 (111) => 7 (111)

• What can do to avoid overhead of address checking instructions for FFT?

• Have an optional “bit reverse” address addressing mode for use with autoincrement addressing

• Many DSPs have “bit reverse” addressing for radix-2 FFT

Page 146: Advanced Topic: High Performance Processors

Savio Chau

DSP vs. General Purpose MPU

• The “MIPS/MFLOPS” of DSPs is speed of Multiply-Accumulate (MAC).

– DSP are judged by whether they can keep the multipliers busy 100% of the time.

• The "SPEC" of DSPs is 4 algorithms: – Inifinite Impule Response (IIR) filters– Finite Impule Response (FIR) filters– FFT, and – convolvers

• In DSPs, algorithms are king!– Binary compatability not an issue

• Software is not (yet) king in DSPs. – People still write in assembly language for a product to

minimize the die area for ROM in the DSP chip.

Page 147: Advanced Topic: High Performance Processors

Savio Chau

DSP Instructions• May specify multiple operations in a single instruction• Must support Multiply-Accumulate (MAC)• Need parallel move support• Usually have special loop support to reduce branch

overhead– Loop an instruction or sequence– 0 value in reigster usually means loop maximum number of

times– Must be sure if calculate loop count that 0 does not mean 0

• May have saturating shift left arithmetic• May have conditional execution to reduce branches

Page 148: Advanced Topic: High Performance Processors

Savio Chau

Summary: How are DSPs different?

• Single cycle multiply accumulate (multiple busses and array multipliers)

• Complex instructions for standard DSP functions (IIR and FIR filters, convolvers)

• Specialized memory addressing – Modular arithmetic for circular buffers (delay lines) – Bit reversal (FFT)

• Zero overhead loops and repeat instructions • I/ O support – Serial and parallel ports

Page 149: Advanced Topic: High Performance Processors

Savio Chau

Summary: Unique Features in DSP architectures

• Continuous I/O stream, real time requirements• Multiple memory accesses• Autoinc/autodec addressing• Datapath

– Multiply width– Wide accumulator– Guard bits/shiting rounding– Saturation

• Weird things– Circular addressing – Reverse addressing

• Special instructions– shift left and saturate (arithmetic left-shift)