Reducing memory penalty by a programmable prefetch engine for on-chip caches
description
Transcript of Reducing memory penalty by a programmable prefetch engine for on-chip caches
Reducing memory penalty by a programmable prefetch engine
for on-chip caches
Presentatie voor het vak computerarchitectuur door
Armin van der Togt
Indeling:
• Probleemstelling
• De prefetch architectuur
• Resultaten
• Conclusies
• Gerelateerd werk
Probleemstelling
• Verschil tussen snelheid van geheugen en CPU wordt steeds groter dus: cache en prefetching
• Hardware prefetching duur en complexe geheugen structuren moeilijk
• Software prefetching veel executie overhead
for j = 0 to 100 for i = 0 to 100 A[j][i] = B[i][0] + B[i+1][0] end prefetch (&A[j][0]) for i = 0 to 5 by 2 prefetch (&B[i+1][0]) prefetch (&B[i+2][0]) prefetch (&A[j][i+1]) end for i = 0 to 93 by 2 prefetch (&B[i+7][0]) prefetch (&B[i+8][0]) prefetch (&A[j][i+7]) A[j][i] = B[i][0] + B[i+1][0] A[j][i+1] = B[i+1][0] + B[i+2][0] end for i = 94 to 100 by 2 A[j][i] = B[i][0] + B[i+1][0] A[j][i+1] = B[i+1][0] + B[i+2][0] end
Original code
Generated code
(inner loop only)
Software prefetching
De prefetch architectuur
PC
ALU
Hare
Prefetch
Engine
on-chip
cache
address base stride count start
I4
I5
Run-Ahead Table
Memory
system
Firing ORQ
Processor chip
iaddr: PC om prefetch te starten
<base, stride>: prefetch adres en stapgrootte
<count, start> : prefetch condities
count: eens in de count keer dat PC=iaddr wordt een prefetch gestart
start: pas na start keer dat aan de bovenstaande conditie is voldaan mag begonnen worden met prefetchen
Nieuwe instructie voor de prefetch engine:
fill_run_ahead iaddr, <base, stride> , <count, start>
Voorbeeld
int W[100], B[100][100]; for (i=0; i<200; i++) for (k=0; k<i; k++) W[i] += B[k][i] + W[i-k-1];
$32: # 6 W[i] += B[k][i] + W[i-k-1]; lw $25, 0($14) # W[i] lw $26, 0($16) # B[k][i] lw $24, 0($15) # W[i-k-1] addu $10, $25, $26 # sum up addu $10, $24, $10 # sum up sw $10, 0($14) # store W[i] addu $15, $15, -4 addu $16, $16, 100 add $4, $4, 1 blt $4, 100, $32 # branch
memory latency
=
5 cycles
Code met prefetch instructies
addu $3, $16, 400 fill_run_ahead I4, $3, 400, 1, 0 # prefetch for B[k][i] addu $3, $15, 16 fill_run_ahead I5, $3, -4, 4, 1 # prefetch for W[i-k-1] $32: I1: lw $25, 0($14) # W[i] I2: lw $26, 0($16) # B[k][i] I3: lw $24, 0($15) # W[i-k-1] addu $10, $25, $26 # sum up addu $10, $24, $10 # sum up sw $10, 0($14) # store W[i] I4: addu $4, $4, 1 I5: addu $16, $16, 100 Addu $15, $15, -4 blt $4, 100, $32 # branch
Resultaten
Conclusies
• Prefetching kan geheugen penalty tot 80% verlagen
• Een programeerbare prefetch engine verlaagt de penalty ten opzichte van software prefetching
• Bij kleine caches (1-2k) is de programmerbare prefetch engine relatief duur
• de compiler moet prefetching ondersteunen
Gerelateerd werk
• Fu and Patel: stride directed prefetching in scalar processors (hardware)
• Mowry and Gupta: software controlled prefetching
• Chiueh: A programmable hardware prefetch architecture for numerical loops (lijkt hier op)
Literatuur
• Tien-Fu Chen, Reducing memory penalty by a programmable prefetch engine for on-chip caches, Microprocessors and Microsystems, 21 (1997) 121-130