Low-Complexity Reorder Buffer Architecture*
-
Upload
lillian-schroeder -
Category
Documents
-
view
28 -
download
1
description
Transcript of Low-Complexity Reorder Buffer Architecture*
ICS’02 1
Low-ComplexityReorder Buffer Architecture*
*supported in part by DARPA through the PAC-C program and NSF
Gurhan Kucuk, Dmitry Ponomarev, Kanad GhoseDepartment of Computer Science
State University of New YorkBinghamton, NY 13902-6000
http://www.cs.binghamton.edu/~lowpower
16th Annual ACM International Conference on Supercomputing (ICS’02), June 24th 2002
ICS’02 2
Outline
ROB complexities
Motivation for the low-complexity ROB
Low-complexity ROB design
Results
Concluding remarks
ICS’02 3
What This Work is All About
Complex, richly-ported ROBs are common in modern superscalar datapaths
Number of ports are aggravated when results are held within ROB slots (Example: Pentium III)
ROB complexity reduction is important for reducing power and improving performance
ROB dissipates a non-trivial fraction of the total chip power
ROB accesses stretch over several cycles
Goal of this work: Reduce the complexity and power dissipation of the ROB without sacrificing performance
ICS’02 4
Pentium III-like Superscalar Datapath
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
ICS’02 5
ROB Port Requirements for a W-way CPU
ROB
WritebackW write portsto write results
Dispatch/Issue2W read ports
to read the source operands
Decode/DispatchW write portsto setup entries
CommitW read portsfor instruction commitment
ICS’02 6
ROB Port Requirements for a W-way CPU
ROB
WritebackW write ports
To write results
Dispatch/Issue2W read ports
to read the source operands
Decode/Dispatch1 W-wide write port
to setup entries
Commit1 W-wide read port
for instruction commitment
ICS’02 7
Where are the Source Values Coming From?
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
12
3
ICS’02 8
Where are the Source Values Coming From ?
0%
20%
40%
60%
80%
100%
Forwarding ARF ROB
96-entry ROB, 4-way processorSPEC2K Benchmarks
62% 32%32% 6%
ICS’02 9
How Efficiently are the Ports Used ?
ROB
WritebackW write ports
To write results
Dispatch/Issue2W read ports
to read the source operands
Decode/DispatchW write portsto setup entries
CommitW read portsfor instruction commitment
6%
ICS’02 10
Approaches to Reducing ROB Complexity
Reduce the number of read ports for reading out the source operand values
More radical (and better): Completely eliminate the read ports for reading source operand values!
ICS’02 11
0
4
8
12
16
1 read port 2 read ports
Reducing the Number of Read PortsP
erfo
rman
ce D
rop
%
048
121620
3.5% 1.0%Average IPC Drop:
bzip2 gap gcc gzip mcf parser perl twolf Int Avg.vortex vpr
applu apsi art equake mesa mgrid swim wupwise FP Avg.
ICS’02 12
Problems with Retaining Fewer Source Read Ports on the ROB
Need arbitration for the small number of ports
Additional logic needed to block the instructions which could not get the port.
Need a switching network to route the operands to correct destinations
Multi-cycle access still remains in the critical path of Dispatch/Issue logic
ICS’02 13
Our Solution: Elimination of Read Ports
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
12
3
ICS’02 14
Our Solution: Elimination of Read Ports
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
12
3
ICS’02 15
Our Solution: Elimination of Read Ports
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
1
3
ROB
ICS’02 16
Comparison of ROB Bitcells (0.18µ, TSMC)
Layout of a 32-ported SRAM bitcell
Layout of a 16-ported SRAM bitcell
Area Reduction – 71%
Shorter bit and wordlines
ICS’02 17
Our Solution: Elimination of Read Ports
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
Area Reduction – 45%
ICS’02 18
Eliminating/Reducing the Number of Read Ports: Effects on Power Dissipation
Power is reduced because:shorter bitlines and wordlines
lower capacitive loading
fewer decoders
fewer drivers and sense amps
ICS’02 19
Completely Eliminating the Source Read Ports on the ROB
The Problem: Issue of instructions that require a value stored in the ROB will stall
Solutions:
Forward the value to the waiting instruction at the time of committing the value: LATE FORWARDING
ICS’02 20
Late Forwarding: Use the Normal Forwarding Buses!
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
Result/status forwarding buses:
ICS’02 21
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
Result/status forwarding buses:
Late Forwarding: Use the Normal Forwarding Buses!
ICS’02 22
Optimizing Late Forwarding
PROBLEM: If Late Forwarding is done for every result that is committed, additional forwarding buses are needed in order not to degrade the performance
SOLUTION: Selective Late Forwarding (SLF)
SLF requires additional bit in the ROBThat bit is set by the dispatched instructions that require Late Forwarding
No additional forwarding buses are needed, since SLF traffic is very small
ICS’02 23
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Only 3.5% of the traffic is from
SELECTIVE LATE FORWARDING
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
Result/status forwarding buses:
Late Forwarding: Use the Normal Forwarding Buses!
ICS’02 24
0
4
8
12
16
No ROB read ports with SLF 1 read port 2 read ports
Performance Drop of Simplified ROB P
erfo
rman
ce D
rop
%
0
5
10
15
20
25
30
9.6% 3.5% 1.0%Average IPC Drop:
bzip2 gap gcc gzip mcf parser perl twolf Int Avg.vortex vpr
applu apsi art equake mesa mgrid swim wupwise FP Avg.
37%
17%
ICS’02 25
IPC Penalty:Source Value Not Accessible within the ROB
ForwardingLate Forwarding/
Commitment
Lifetime of a Result Value
ResultGeneration
time
Valuewithin ARF
Valuewithin ROB
ICS’02 26
Improving IPC with No Read Ports
Cache recently generated values in a set of RETENTION LATCHES (RL)
Retention Latches are SMALL and FAST
Only 8 to 16 latches needed in the set
Entire set has 1 or 2 read ports
ICS’02 27
Datapath with the Retention Latches
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
Architectural Register File
ICS’02 28
Datapath with the Retention Latches
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
RETENTION LATCHES
ROB
ICS’02 29
The Structure of the Retention Latch Set
L ROB slot addresses(L=1 or 2)
L-ported CAM field(key = ROB_slot_id)
W write ports for writing up to W results in parallel
Status
L recently-written results (L=1 or 2 works great)
Result Values
8 or 16 latches
ICS’02 30
Retention Latch Management Strategies
FIFO
8 entry RL: 42% hit rate
16 entry RL: 55% hit rate
LRU
8 entry RL: 56% hit rate
16 entry RL: 62% hit rate
Random Replacement
Worse performance than FIFO
ICS’02 31
Hit Ratios to Retention Latches
0
20
40
60
80
100
FIFO 8 2 FIFO 16 2 LRU 8 2 LRU 16 2
42% 55% 56% 62%
0
20
40
60
80
100
Hit
Rat
ios
bzip2 gap gcc gzip mcf parser perl twolf Int Avg.vortex vpr
applu apsi art equake mesa mgrid swim wupwise FP Avg.
Average Hit Ratio:
ICS’02 32
Accessing Retention Latch Entries
ROB index is used as a unique key in the Retention Latches to search the result values
Need to maintain unique keys even when we have:
Reuse of a ROB slot:Not a problem for FIFO
simply flush a RL entry at commit time for LRU
Branch mispredictions
ICS’02 33
Handling Branch Mispredictions
Selective RL Flushing: Retention latch entries that are in the mispredicted path are flushed
Uses branch tagsComplicated implementation
Complete RL Flushing: All retention latch entries are flushed
Very simple implementationPerformance drop is only 1.5% compared to selective flushing
ICS’02 34
Misprediction Handling: Performance
0
0.5
1
1.5
2
2.5
3
3.5
bzip gap gcc gzip mcf pars perl twol vort vpr appl apsi art equ mesa mgrid swim wupw Int. FP Avg.
Selective flushing Complete flushing
1.5%Average IPC Drop:
IPC
ICS’02 35
Scenario 1: Traditional Design
5ROB index
Src1 valid ?
Src1 value ?
?
?
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 arch. 3
Src1 arch. 2
ADDInstruction
Instruction: ADD R1, R2, R3
ICS’02 36
Scenario 1: Traditional Design
5ROB index
Src1 valid ?
Src1 value ?
?
?
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 reg. 3
Src1 reg. 2
ADDInstruction
Instruction: ADD R1, R2, R3
Arch.ROB=0ARF=1
0
1
2
3
4
…
… …
… …
… …
… …
12
3
0
1
ROB#/Phys.
Rename Table
ICS’02 37
Scenario 1: Traditional Design
5ROB index
Src1 valid ?
Src1 value ?
?
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 reg. 3
Src1 reg. 2
ADDInstruction
Instruction: ADD R1, R2, R3
Arch.ROB#/Phys.
ROB=0ARF=1
0
1
2
3
4
…
… …
… …
… …
… …
12
3
0
1
?
ROB#/Phys.
Phys.valid
Phys.value
… … …
12
… … …
1 7
Rename Table
ROB
ICS’02 38
Scenario 1: Traditional Design
5ROB index
Src1 valid 1
Src1 value 7
?
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 reg. 3
Src1 reg. 2
ADDInstruction
Instruction: ADD R1, R2, R3
Arch.ROB#/Phys.
ROB=0ARF=1
0
1
2
3
4
…
… …
… …
… …
… …
12
3
0
1
?
ROB#/Phys.
Phys.valid
Phys.value
… … …
12
… … …
1 7
Rename Table
ROB
ICS’02 39
Scenario 1: Traditional Design
5ROB index
Src1 valid ?
Src1 value ?
?
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 reg. 3
Src1 reg. 2
ADDInstruction
Instruction: ADD R1, R2, R3
Arch.ROB#/Phys.
ROB=0ARF=1
0
1
2
3
4
…
… …
… …
… …
… …
12
3
0
1
?
ROB#/Phys.
Phys.valid
Phys.value
… … …
12
… … …
0 ?
Rename Table
ROB
ICS’02 40
Scenario 1: Traditional Design
5ROB index
Src1 valid 0
Src1 value ?
?
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 reg. 3
Src1 reg. 2
ADDInstruction
Instruction: ADD R1, R2, R3
Arch.ROB#/Phys.
ROB=0ARF=1
0
1
2
3
4
…
… …
… …
… …
… …
12
3
0
1
?
ROB#/Phys.
Phys.valid
Phys.value
… … …
12
… … …
0 ?
Rename Table
ROB
ICS’02 41
Scenario 1: Traditional Design
5ROB index
Src1 valid 1
Src1 value 7
?
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 reg. 3
Src1 reg. 2
ADDInstruction
Instruction: ADD R1, R2, R3
Arch.ROB#/Phys.
ROB=0ARF=1
0
1
2
3
4
…
… …
… …
… …
… …
12
3
0
1
?
Arch. Arch.value
… …
3
… …
43Rename Table
ARF
ICS’02 42
Scenario 1: Traditional Design
5ROB index
Src1 valid 1
Src1 value 7
43
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 reg. 3
Src1 reg. 2
ADDInstruction
Instruction: ADD R1, R2, R3
Arch.ROB#/Phys.
ROB=0ARF=1
0
1
2
3
4
…
… …
… …
… …
… …
12
3
0
1
1
Arch. Arch.value
… …
3
… …
43Rename Table
ARF
ICS’02 43
Scenario 2: Simplified ROB with RLs
5ROB index
Src1 valid ?
Src1 value ?
?
?
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 arch. 3
Src1 arch. 2
ADDInstruction
Instruction: ADD R1, R2, R3
ICS’02 44
Scenario 2: Simplified ROB with RLs
5ROB index
Src1 valid ?
Src1 value ?
?
?
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 reg. 3
Src1 reg. 2
ADDInstruction
Instruction: ADD R1, R2, R3
Arch.ROB=0ARF=1
0
1
2
3
4
…
… …
… …
… …
… …
12
3
0
1
ROB#/Phys.
Rename Table
ICS’02 45
Scenario 2: Simplified ROB with RLs
5ROB index
Src1 valid ?
Src1 value ?
?
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 reg. 3
Src1 reg. 2
ADDInstruction
Instruction: ADD R1, R2, R3
Arch.ROB#/Phys.
ROB=0ARF=1
0
1
2
3
4
…
… …
… …
… …
… …
12
3
0
1
?
ROB#/Phys.
Phys.value
… …
12
… …
7
Rename Table
RetentionLatches
ICS’02 46
Scenario 2: Simplified ROB with RLs
5ROB index
Src1 valid 1
Src1 value 7
?
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 reg. 3
Src1 reg. 2
ADDInstruction
Instruction: ADD R1, R2, R3
Arch.ROB#/Phys.
ROB=0ARF=1
0
1
2
3
4
…
… …
… …
… …
… …
12
3
0
1
?
Rename Table
ROB#/Phys.
Phys.value
… …
12
… …
7RetentionLatches
ICS’02 47
Scenario 2: Simplified ROB with RLs
5ROB index
Src1 valid ?
Src1 value ?
?
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 reg. 3
Src1 reg. 2
ADDInstruction
Instruction: ADD R1, R2, R3
Arch.ROB#/Phys.
ROB=0ARF=1
0
1
2
3
4
…
… …
… …
… …
… …
12
3
0
1
?
Rename Table
ROB#/Phys.
Phys.value
… …
…
… …
…MISS RetentionLatches
ICS’02 48
Scenario 2: Simplified ROB with RLs
5ROB index
Src1 valid 0
Src1 value ?
?
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 reg. 3
Src1 reg. 2
ADDInstruction
Instruction: ADD R1, R2, R3
Arch.ROB#/Phys.
ROB=0ARF=1
0
1
2
3
4
…
… …
… …
… …
… …
12
3
0
1
?
ROB#/Phys.
Phys.valid
Phys.value
… … …
12
… … …
X XRename Table
ROB
ROB#/Phys.
Phys.value
… …
…
… …
…RetentionLatches
MISS
X: Don’t Care
SLF
…
…
0
ICS’02 49
Scenario 2: Simplified ROB with RLs
5ROB index
Src1 valid 0
Src1 value ?
?
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 reg. 3
Src1 reg. 2
ADDInstruction
Instruction: ADD R1, R2, R3
Arch.ROB#/Phys.
ROB=0ARF=1
0
1
2
3
4
…
… …
… …
… …
… …
12
3
0
1
?
ROB#/Phys.
Phys.valid
Phys.value
… … …
12
… … …
X XRename Table
ROB
ROB#/Phys.
Phys.value
… …
…
… …
…RetentionLatches
MISS
X: Don’t Care
SLF
…
…
1
ICS’02 50
Scenario 2: Simplified ROB with RLs
5ROB index
Src1 valid 1
Src1 value 7
?
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 reg. 3
Src1 reg. 2
ADDInstruction
Instruction: ADD R1, R2, R3
Arch.ROB#/Phys.
ROB=0ARF=1
0
1
2
3
4
…
… …
… …
… …
… …
12
3
0
1
?
Arch. Arch.value
… …
3
… …
43Rename Table
ARF
ICS’02 51
Scenario 2: Simplified ROB with RLs
5ROB index
Src1 valid 1
Src1 value 7
43
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 reg. 3
Src1 reg. 2
ADDInstruction
Instruction: ADD R1, R2, R3
Arch.ROB#/Phys.
ROB=0ARF=1
0
1
2
3
4
…
… …
… …
… …
… …
12
3
0
1
1
Arch. Arch.value
… …
3
… …
43Rename Table
ARF
ICS’02 52
Experimental Setup: the AccuPower (DATE’02)
CompiledSPEC
benchmarks
Datapathspecs
Performance stats
VLSI layoutdata
SPICEdeck
SPICE
MicroarchitecturalSimulator(Rooted in
SimpleScalar)
Energy/PowerEstimator
Power/energystats
SPICE measures ofenergy per transition
Transition counts,Context information
ICS’02 53
Configuration of the Simulated System
Machine width 4-way
Issue Queue 32 entries
96 entriesReorder Buffer
Load/Store Queue 32 entries
Simulated the execution of SPEC2000 benchmarks
ICS’02 54
Assumed Timings
Rename Tablelookup forROB index
Rename TableLookup forROB index
Associativelookup ofoperand fromretention latchesusing ROBindex as a key
Source operandread from the ROB
Source operandread from the ROB
Smaller delay:few latches
D1 D2 D3 D1 D2
Timing of the baseline model Timing of the simplified ROB
ICS’02 55
-5
-3
-1
1
3
5
8 2-ported FIFO 8 2-ported LRU 16 2-ported FIFO 16 2-ported LRU
Experimental Results: Effect on PerformanceP
erfo
rman
ce D
rop
%
-6
-4
-2
0
2
4
6
0.1% -1.6% -1.0% -2.3%
applu apsi art equake mesa mgrid swim wupwise FP Avg.
bzip2 gap gcc gzip mcf parser perl twolf Int Avg.vortex vpr
Avg. IPC Drop:
ICS’02 56
0
2
4
6
8
8 2-ported FIFO 8 2-ported LRU 16 2-ported FIFO 16 2-ported LRU
Experimental Results: Effect on PerformanceP
erfo
rman
ce D
rop
%
0
2
4
6
8
10
3.3% 1.7% 2.3% 1.0%
applu apsi art equake mesa mgrid swim wupwise FP Avg.
bzip2 gap gcc gzip mcf parser perl twolf Int Avg.vortex vpr
Avg. IPC Drop:
ICS’02 57
0
10
20
30
40
No RO B ports 8 2-ported FIFO 8 2-ported LRU 16 2-ported FIFO 16 2-ported LRU
Experimental Results: Effect on PowerP
ower
Sav
ings
%
0
10
20
30
40
50
bzip2 gap gcc gzip mcf parser perl twolf Int Avg.vortex vpr
applu apsi art equake mesa mgrid swim wupwise FP Avg.
30% 23.4% 22.2% 21% 20.2%Avg. Savings:
ICS’02 58
Summary of Results
Significantly reduced ROB complexity and power dissipation
45% area reduction
20% to 30% power reduction across SPEC 2000 benchmarks
Actual IPC improvements:
1.6% to 2.3% gain across SPEC benchmarks
IPC gains come from 1 cycle access to RL (vs. 2 cycles that would be needed for ROB access)
ICS’02 59
Related Work
Value-Aging Buffer (Hu & Martonosi, PACS 2000)
Forwarding Buffer and Clustered Register Cache (Borch et.al., HPCA’02)
Multiple Register Banks (Cruz et.al., ISCA’00 & Balasubramonian et.al., MICRO’01)
See paper for discussions
ICS’02 60
Conclusions
Typical source operand location statistics can be successfully exploited to reduce ROB complexity
Significant reduction in ROB area and power – no ROB ports needed for reading source operands
IPC gains are possible because of the use of a small sized, low-ported Retention Latch to supply cached operand values in a single cycle
ICS’02 61
Low-ComplexityReorder Buffer Architecture*
*supported in part by DARPA through the PAC-C program and NSF
Gurhan Kucuk, Dmitry Ponomarev, Kanad GhoseDepartment of Computer Science
State University of New YorkBinghamton, NY 13902-6000
http://www.cs.binghamton.edu/~lowpower
16th Annual ACM International Conference on Supercomputing (ICS’02), June 24th 2002