Lecture 11: Memory Data Flow Techniques
description
Transcript of Lecture 11: Memory Data Flow Techniques
1
Lecture 11: Memory Data Flow Techniques
Load/store buffer design, memory-level parallelism, consistency model, memory disambiguation
2
Load/Store Execution Steps
Load: LW R2, 0(R1)1. Generate virtual
address; may wait on base register
2. Translate virtual address into physical address
3. Write data cache
Store: SW R2, 0(R1)1. Generate virtual
address; may wait on base register and data register
2. Translate virtual address into physical address
3. Write data cacheUnlike in register accesses, memory addressesare not known prior to execution
3
Load/store Buffer in TomasuloSupport memory-level
parallelism
Loads wait in load buffer until their address is ready; memory reads are then processed
Stores wait in store buffer until their address and data are ready; memory writes wait further until stores are committed
ReorderBufferDecode
FU1 FU2
RS RS
Fetch Unit
Rename
L-bufS-buf
DM
Regfile
IM
4
Load/store Unit with Centralized RS
Centralized RS includes part of load/store buffer in Tomasulo
Loads and stores wait in RS until there are ready
ReorderBufferDecode
FU1 FU2
Fetch Unit
Rename
S-unit
Regfile
IM
RS
cache
L-unit
data addr
addr
Store buffer
5
Memory-level Parallelism
for (i=0;i<100;i++)A[i] = A[i]*2;
Loop: L.S F2, 0(R1)MULT F2, F2,
F4SW F2, 0(R1)ADD R1, R1, 4BNE R1,
R3,Loop
F4 store 2.0
LW1
SW1
LW2
SW2
LW3
SW3
Significant improvement from sequential reads/writes
6
Memory Consistency
Memory contents must be the same as by sequential execution
Must respect RAW, WRW, and WAR dependences
Practical implementations:1. Reads may proceed out-of-order2. Writes proceed to memory in program
order3. Reads may bypass earlier writes only if
their addresses are different
7
Store Stages in Dynamic Execution
1. Wait in RS until base address and store data are available (ready)
2. Move to store unit for address calculation and address translation
3. Move to store buffer (finished)
4. Wait for ROB commit (completed)
5. Write to data cache (retired)
Stores always retire in for WAW and WRA Dep.
Source: Shen and Lipasti, page 197
finished
completed
D-cache
RS
Storeunit
Loadunit
8
Load Bypassing and Memory Disambiguation
To exploit memory parallelism, loads have to bypass writes; but this may violate RAW dependences
Dynamic Memory Disambiguation: Dynamic detection of memory dependencesCompare load address with every older store addresses
9
Load Bypassing Implementation
12
D-cache
RS
Storeunit
12
Loadunit
3
1. address calc.2. address trans.3. if no match, updatedest reg
Associative search formatching
match
addrdata
data
addr Assume in-order executionof load/stores
in-order
10
Load Forwarding
Load Forwarding: if a load address matches a older write address, can forward data
If a match is found, forward the related data to dest register (in ROB)
Multiple matches may exists; last one wins
12
D-cache
Storeunit
12
Loadunit
3match
addrdata
data
addr
To dest.reg
RSin-order
11
In-order Issue LimitationAny store in RS station
may blocks all following loads
When is F2 of SW available?
When is the next L.S ready?
Assume reasonable FU latency and pipeline length
for (i=0;i<100;i++)A[i] = A[i]/2;
Loop: L.S F2, 0(R1)DIV F2, F2,
F4SW F2, 0(R1)ADD R1, R1, 4BNE R1,
R3,Loop
12
Speculative Load Execution
12
D-cache
RS
Storeunit
12
Loadunit
3match
addrdata
data
out-order
addrdataFinishedload buffer
Match at completion
If match: flush pipeline
Forwarding does not always work if some addresses are unknown
No match: predict a load has no RAW on older stores
Flush pipeline at commit if predicted wrong
13
Alpha 21264 Pipeline
14
Alpha 21264 Load/Store Queues
AddrALU
IntALU
IntALU
AddrALU
Int issue queue fp issue queue
FPALU
FPALU
Int RF(80) Int RF(80) FP RF(72)
D-TLB L-Q S-Q AF
Dual D-Cache
32-entry load queue, 32-entry store queue
15
Load Bypassing, Forwarding, and RAW Detection
commitmatch
D-cacheD-cache
If match: mark store-load trapto flush pipeline (at commit)
If match: forwarding
IQ IQ completed
Load/store?ROBLoad: WAIT if LQ head not completed, then move LQ headStore: mark SQ head as completed, then move SQ head
LQ SQ
16
Speculative Memory Disambiguation
1024 1-bitentry table
Fetch PC
Renamed inst
int issue queue
1
• When a load is trapped at commit, set stWait bit in the table, indexed by the load’s PC• When the load is fetched, get its stWait from the table• The load waits in issue queue until old stores are issued• stWait table is cleared periodically
Load forwarding
17
Architectural Memory States
Memory request: search the hierarchy from top to bottom
L1-Cache
L2-Cache
Memory
Disk, Tape, etc.
L3-Cache (optional)
Completed entries
LQSQ Committed
states
18
Summary of Superscalar Execution
Instruction flow techniquesBranch prediction, branch target prediction,
and instruction prefetch
Register data flow techniquesRegister renaming, instruction scheduling, in-
order commit, mis-prediction recovery
Memory data flow techniquesLoad/store units, memory consistency
Source: Shen & Lipasti