Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory...
-
Upload
noel-nolan-tuckness -
Category
Documents
-
view
248 -
download
0
Transcript of Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory...
Using Virtual Load/Store Queues Using Virtual Load/Store Queues (VLSQs) to Reduce(VLSQs) to Reduce
The Negative Effects of Reordered The Negative Effects of Reordered Memory InstructionsMemory Instructions
Aamer Jaleel and Bruce JacobElectrical and Computer Engineering,University of Maryland, College Park
{ajaleel, blj} @ eng.umd.edu
A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”
Paper Motivation• Maximizing Application ILP:
– OoO performance depends on size of instruction window or reorder buffer (ROB)
– Improve ILP by larger ROB sizes
• Before This Paper:– Many studies have showed large performance gains with
large ROBs– Most have discounted real effects in memory subystem
A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”
Paper Contributions• Uncovering A Problem:
– Increasing OoO capability degrades memory system performance
• Increase in replay traps • Increase in L1 cache misses
• The Reason:– OoO scheduler reordering memory instructions
• The Solution:– Restrict reordering of memory instructions – Virtual Load/Store Queue (VLSQ)
A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”
Background – Replay Traps• Hardware events to ensure correct
execution order of memory instructions
• Types of Replay Traps– Load-Store Replay Trap– Wrong-Size Replay Trap– Load-Load Replay Trap– Load-Miss Load Replay Trap
Load-Store Replay
2. ST BYTE A (3)
3. LD BYTE A (2)
1. LD BYTE A (1)
4. LD BYTE B (4)
Wrong Size Replay
2. ST BYTE A (2)
3. LD HALF A (3)
1. LD BYTE A (1)
4. LD BYTE B (4)
Load-Miss Load Replay
3. LD BYTE A (3)
2. ST BYTE A (2)1. +LD BYTE A (1)
4. LD BYTE B (4)
P2P1
2. ST BYTE A (2)
3. LD BYTE A (1)1. LD BYTE A (4)
4. LD BYTE B (3)
P2P1
Load-Load Replay
A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”
Experimental Framework
• Simulator:– Sim-Alpha– 64K 2-Way IL1/DL1, 2MB 4-Way L2, 8 MSHRS / cache– Branch predictor: 4K BTB, and 2K hybrid g-share/bimodal– 1024-entry store-wait predictor– Hardware data prefetcher: 2-Way 256-entry stride table and eight
8-entry stream buffers– Detailed DDR2 DRAM model with queuing delays
• Benchmarks– SPEC2000
A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”
The Problem w/↑ OoO Capability
• Replay Traps:– Trap frequency increases by a factor of 5– Trap overhead increases by 10-60%
• L1 Cache Misses:– Number of cache misses increase by 15% (average)– fma3d, mesa, wupwise, eon, vpr, twolf, swim (20% – 40%)
Traps / 1000 Instructions
ROB-80 ROB-512ROB-128 ROB-256
% Increase in L1 Cache Misses(compared to ROB 80)
ROB-80 ROB-512ROB-128 ROB-256
A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”
Why The Problem? • OoO execution reorders both ALU and memory
instructions• Replay traps and cache misses are problems
associated with memory instructions• Hypothesis:
– Reordering of ALU Instructions poses little or no threats
BUT– Reordering of memory instructions causes the problem
A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”
How many are issued in-order?
• 10 to 20% of memory instructions are issued in order with increased OoO capability
Need to reduce reordering of memory instructions
0 W-WDistance From Being Issued In Program Order
% M
emo
ry I
nst
ruct
ion
s 55%
10%15%
21%
Issued Late Issued Early
In-order Issue
ROB 80ROB 128ROB 256ROB 512
A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”
Virtual Load/Store Queue (VLSQ)
• Traditional LSQ: Any ready instruction is issued
Traditional Load/Store Queue
MEM 0MEM 1MEM 2MEM 3MEM 4MEM 5
.MEM N-1MEM N
LSQ HEAD
LSQ TAIL
ISSUED READY NOT READY
A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”
Virtual Load/Store Queue (VLSQ)
• Traditional LSQ: Any ready instruction is issued• Virtual LSQ: Only issue instructions residing in a
virtual window
Traditional Load/Store Queue Virtual Load/Store Queue
MEM 0MEM 1MEM 2MEM 3MEM 4MEM 5
.MEM N-1MEM N
LSQ HEAD
LSQ TAIL
Virtual WindowSize = Inf
VIRTUAL HEAD
MEM 0MEM 1MEM 2MEM 3MEM 4
.MEM N-1MEM N
LSQ HEAD
LSQ TAIL
VIRTUAL TAILMEM 5
Virtual WindowSize = 4
ISSUED READY NOT READY
A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”
Virtual Load/Store Queue (VLSQ)
• Traditional LSQ: Any ready instruction is issued• Virtual LSQ: Only issue instructions residing in a
virtual window
Traditional Load/Store Queue Virtual Load/Store Queue
MEM 0MEM 1MEM 2MEM 3MEM 4MEM 5
.MEM N-1MEM N
LSQ HEAD
LSQ TAIL
Virtual WindowSize = Inf
VIRTUAL HEAD
MEM 0MEM 1MEM 2MEM 3MEM 4
.MEM N-1MEM N
LSQ HEAD
LSQ TAIL
VIRTUAL TAILMEM 5
Virtual WindowSize = 4
ISSUED READY NOT READY
A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”
Virtual Load/Store Queue (VLSQ)
• Traditional LSQ: Any ready instruction is issued• Virtual LSQ: Only issue instructions residing in a
virtual window• Virtual window slides down only when instruction at
virtual head is issued
Traditional Load/Store Queue Virtual Load/Store Queue
MEM 0MEM 1MEM 2MEM 3MEM 4MEM 5
.MEM N-1MEM N
LSQ HEAD
LSQ TAIL
Virtual WindowSize = Inf
VIRTUAL HEAD
MEM 0MEM 1MEM 2MEM 3MEM 4
.MEM N-1MEM N
LSQ HEAD
LSQ TAIL
VIRTUAL TAILMEM 5
Virtual WindowSize = 4
ISSUED READY NOT READY
A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”
MEM 3
Virtual Load/Store Queue (VLSQ)
• Traditional LSQ: Any ready instruction is issued• Virtual LSQ: Only issue instructions residing in a
virtual window• Virtual window slides down only when instruction at
virtual head is issued
Traditional Load/Store Queue Virtual Load/Store Queue
MEM 0MEM 1MEM 2MEM 3MEM 4MEM 5
.MEM N-1MEM N
LSQ HEAD
LSQ TAIL
Virtual WindowSize = Inf
VIRTUAL HEAD
MEM 0MEM 1MEM 2
MEM 4
.MEM N-1MEM N
LSQ HEAD
LSQ TAIL
VIRTUAL TAIL
MEM 5Virtual WindowSize = 4
ISSUED READY NOT READY
A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”
VLSQs: Replay Trap Stats
• ↑ OoO Aggressiveness (ROB from 80 512 entries)– 5X increase in trap frequency
• VLSQs reduce trap frequency by factors of 2-30
– 25-60% of total execution time spent in traps
• VLSQs reduce total time handling traps by 10-40%
Direct correlation between memory ordering and replay traps
ROB-80 ROB-512ROB-128 ROB-256 ROB-80 ROB-512ROB-128 ROB-256
Replay Traps / 1000 Instructions Replay Trap Penalty
Inf643216
841
A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”
VLSQs: DL1 Cache Stats
• ↑ OoO Aggressiveness (ROB from 80 512 entries)– 55% Increase in L1 Cache Accesses
• VLSQs reduce cache accesses by upto 55%
– 15% Increase in L1 Cache Misses
• VLSQs reduce cache misses by upto 10%
Direct correlation between memory ordering and cache accesses
ROB-80 ROB-512ROB-128 ROB-256 ROB-80 ROB-512ROB-128 ROB-256
Normalized Accesses Normalized Misses
Inf643216
841
A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”
VLSQ Performance
• Applications show three different behaviors– Group I: Performance same – non-memory intensive apps– Group II: Performance loss – memory intensive apps– Group III: Performance benefit – alleviating negative effects
• VLSQ of size 16 or 32 is ideal across all apps
Inf
64
3216
8
41
VLSQ Sizes
ROB-512 ROB-512 ROB-512
CPICPICPI
MEMORYALU
OTHER
GROUP IIIGROUP IIGROUP I
A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”
Power Savings with VLSQs
• Reducing Replay Traps– 5-60% power savings in fetch/map/exec hardware
• Reducing Cache Accesses and Misses– 5-65% savings in L1 data cache
• Savings of 25-30% using VLSQs of 16 or 32
VLSQ 64VLSQ 32
VLSQ 4VLSQ 16
VLSQ 1VLSQ 8
Execution Units(Normalized to Inf)
L1 Cache(Normalized to Inf)
ROB 080ROB 128ROB 256ROB 512
VLSQ 64VLSQ 32
VLSQ 4VLSQ 16
VLSQ 1VLSQ 8
A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”
Windowing of Load/Store Queue
• Static Mechanism (This Study):– Statically set the size of the virtual window– Drawback: Memory ILP lost during execution phase
where negative effects do not exist
• Dynamic Mechanism (Future Work):– Intuition that negative effects do not always exist– Dynamically vary virtual window size based on
application execution behavior• Virtual window initially infinite
• Vary window size based on certain thresholds
A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”
Summary• In This Paper :
– Problem: Increasing in replay traps and cache misses– Reason: Reordering of memory instructions– Solution: Virtual Load/Store Queues (VLSQs)
• Points To Take Home:
– Mechanism to improve performance causes degradation in the memory subsystem
– OoO cores shouldn’t always be on full throttle –– Because… at times we’ll NEED to tug on the reins
A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”
BACKUP SLIDES
THANK YOU!!!!
A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”
Agenda• Motivation: Why is this study important?• Paper Contributions
– The Problem– The Reason
• Background• Virtual Load Store Queues (VLSQs)• A Limit Study Using VLSQs• Summary
A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”
Background – Replay Traps• Replay traps are hardware enforced to
– Force accesses to a particular memory location in order• Ensure CORRECT execution
• Ensure multi-processor memory consistency
– Handle different sized accesses to same address
• Replay traps are NOT related to OS trap events, i.e. no handler support is needed
• Recovering from a replay trap– Similar to handling branch mispredicts– Pipeline is flushed and execution restarts from the replay
trap causing instruction
A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”
OoO Hardware – Background• Reorder Buffer (ROB), Issue Queues (Integer or
Floating Point), and Load/Store Queues
ROBIQ
FQ
LQ
SQBP
IC
LP
RN
FETCHRENAME
UNIT
SCH
HD HD
HD
HD HD
TL
TL TL
TL
TL
BP = Branch PredictorLP = Line Predictor
IC = Instruction CacheRN = Register Rename
A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”
The Problem – ↑ L1 Cache Misses
• Increasing ROB size from 80 to 512– 5–40% increase in L1 cache misses when compared to ROB-80
ROB 128ROB 256ROB 512
A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”
The Problem – ↑ Replay Traps• Increasing ROB size from 80 to 512
– 10–60% increase in replay trap overhead
ROB 080ROB 128ROB 256ROB 512
A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”
VLSQ Performance
ROB-80 ROB-512ROB-128 ROB-256 ROB-80 ROB-512ROB-128 ROB-256
ROB-80 ROB-512ROB-128 ROB-256
A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions”
Replay Trap DistributionLEGEND: Load-Store Wrong-Size Load-Load Load-Miss Load