GIBERT et al. (1998) - Hominid status of the Orce cranial fragment reasserted
ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González...
-
Upload
jennifer-baum -
Category
Documents
-
view
213 -
download
0
Transcript of ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González...
ICS’02
UPC
An Interleaved Cache Clustered VLIW Processor
An Interleaved Cache Clustered VLIW Processor
E. Gibert, J. Sánchez* and A. González*
Dept. d’Arquitectura de Computadors
Universitat Politècnica de Catalunya (UPC)* Also at Intel Barcelona Research Center
June 2002
ICS’02
UPC
MotivationMotivation
Capacity-bound vs. Communication-bound Solution: clustered microarchitectures
• Partition some hardware resources• Simpler + faster
• Power consumption
• Communications not homogeneous
Goal: clustering the memory hierarchy in statically scheduled processors
Motivation
ICS’02
UPC
Talk OutlineTalk Outline
State-of-the-art: multiVLIW Interleaved Cache Clustered VLIW Scheduling Algorithms Enhancement: Attraction Buffers Experimental Framework Results Conclusions
ICS’02
UPC
State-of-the-art: MultiVLIWState-of-the-art: MultiVLIW
Sánchez and González [MICRO’00]
Reg. File
F.U.
L1 datacache
Clu
ste r
1 Reg. File
F.U.
L1 datacache
Clu
ste r
2 Reg. File
F.U.
L1 datacache
Clu
ste r
nCoherency network
...
Register-to-register buses
Next memory levelNext memory level
ICS’02
UPC
Talk OutlineTalk Outline
State-of-the-art: multiVLIW Interleaved Cache Clustered VLIW Scheduling Algorithms Enhancement: Attraction Buffers Experimental Framework Results Conclusions
ICS’02
UPC
Basic Interleaved Cache Clustered VLIW Processor
Basic Interleaved Cache Clustered VLIW Processor
Reg. FileReg. File
FUsFUs
TAG W0 W4
cache module
Reg. FileReg. File
FUsFUs
TAG W1 W5
cache module
Reg. FileReg. File
FUsFUs
TAG W2 W6
cache module
Reg. FileReg. File
FUsFUs
TAG W3 W7
cache module
TAG W0 W1 W2 W4 W5 W6 W7W3
Subblock 1memory buses
NEXT MEMORY LEVELcacheblock
Register-to-register buses
CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4
ICS’02
UPC
Talk OutlineTalk Outline
State-of-the-art: multiVLIW Interleaved Cache Clustered VLIW Scheduling Algorithms Enhancement: Attraction Buffers Experimental Framework Results Conclusions
ICS’02
UPC
Modulo SchedulingModulo Scheduling
Extract ILP from loops overlap execution of iterations
AA
BB
CC
AA
BB
CC
A’A’
B’B’
C’C’
A’’A’’
B’’B’’
C’’C’’
II
SC
Kernel
LOOP L
ICS’02
UPC
Base Scheduling AlgorithmBase Scheduling Algorithm
Used for Unified Cache
II=II+1
Best profit inoutput edges
START
Sort nodes
Next nodeSelect possible
clusters HowMany?
Least loaded
Schedule it HowMany?
>0
>1
1
0
ICS’02
UPC
Interleaved Cache Scheduling Algorithm
Interleaved Cache Scheduling Algorithm
Unroll loop to maximize instructions with a stride multiple of NxI access ONE cache module
Assign latencies to memory instructions Assign memory instructions to clusters:
– IPBC (Interleaved Pre-Build Chains) minimize stall time
– IBC (Interleaved Build Chains) minimize compute time
ICS’02
UPC
Memory Dependent Instructions
Memory Dependent Instructions
store
load
add
load
add
store
store
load load
store
memorydependant
chain 1
memorydependant
chain 2
IPBC preferred info is usedvs.
IBC minimize register comms.Preferred=1
Preferred=1
Preferred=2
Preferred=2
ICS’02
UPC
Talk OutlineTalk Outline
State-of-the-art: multiVLIW Interleaved Cache Clustered VLIW Scheduling Algorithms Enhancement: Attraction Buffers Experimental Framework Results Conclusions
ICS’02
UPC
LocalData
LocalData ABufferABuffer
loca
l log
ic
data hit
data data hithit
ADDRESS
TAG W2 W6
=
TAG W
ADDRESS
datahit
ATTRACTION BUFFER
word select
CACHE MODULE
Enhacement: Attraction Buffers
Enhacement: Attraction Buffers
ICS’02
UPC
for (i=0; i<MAX; i++) { ld r3, a[i] r4 = OP(r3) st r4, b[i]}
for (i=0; i<MAX; i+=4) { ld r31, a[i] (stride 16) ld r32, a[i+1] ld r33, a[i+2] ld r34, a[i+3] r41 = OP(r31) r42 = OP(r32) r43 = OP(r33) r44 = OP(r34) st r41, b[i] st r42, b[i+1] st r43, b[i+2] st r44, b[i+3]}
16 byte strides (NxI multiple)N = 4 clusters, I= 4 bytes
Unroll x4
An ExampleAn Example
a[3] a[7] a[0] a[4]
CLUSTER 4
ABufferLocal module
ld r31, a[0]
CL
US
TE
R 3
CL
US
TE
R 2
CL
US
TE
R 1
a[0] a[1] a[2] a[3] ...
ICS’02
UPC
Enhacement: Attraction Buffers
Enhacement: Attraction Buffers
Why remote accesses? Why Attraction Buffers?– Double precision accesses low benefit– Indirect accesses: a[b[i]] low benefit– “Unclear” preferred cluster big benefit
for (i=0; i<MAX; i++)for (k=i; k<i+MAX; k+=4)
ld a[k], ld a[k+1], ld a[k+2], ld a[k+3]
– Memory dependent chains big benefit
– IBC: preferred cluster info is not used big benefit
ICS’02
UPC
Talk OutlineTalk Outline
State-of-the-art: multiVLIW Interleaved Cache Clustered VLIW Scheduling Algorithms Enhancement: Attraction Buffers Experimental Framework Results Conclusions
ICS’02
UPC
Experimental FrameworkExperimental Framework
IMPACT C compiler Modulo scheduling on hyperblock loops
– BASE for a Unified Cache– IPBC and IBC for an Interleaved Cache– IPBC and IBC for the MultiVLIW– The same unrolling factor has been used for
all architecture configurations! Mediabench benchmark suite
ICS’02
UPC
Experimental FrameworkExperimental Framework
Number of clusters 4
Functional units 1 FP / cluster + 1 int / cluster
+ 1 mem / cluster
Cache configuration 8KB, 32-byte lines, 2-way set associative, 1 cycle latency
Reg-to-reg communication buses
4 buses that run at ½ the core frequency
Memory buses 4 buses that run at ½ (or ¼)
the core frequency
Next memory level 4 ports, 5 cycle latency, always hit
Interleaving factor
(Interleaved Cache)
4 bytes
Latencies 1-10 (Unified Cache + MultiVLIW)
1-(5/6)-10-15 (Interleaved Cache)
ICS’02
UPC
Results (I)Results (I)
IPBC vs IBC similar cycle count results MultiVLIW vs Interleaved similar results BUT…
… lower complexity!
0
0,5
1
1,5
2
Nu
mb
er
of
cyckes
epic
dec
epic
enc
gsm
dec
gsm
enc
jpegdec
jpegenc
mpeg2dec
mpeg2enc
rasta
Comparison with Unified Cache and MultiVLIW
stall time
compute time MULTIVLIW
compute time INTERLEAVED1/2 1/4
ICS’02
UPC
Results (II)Results (II)
Memory dependent chains– Interleaved cache workload unbalance + remote accesses
– MultiVLIW workload unbalance
– Working on techniques to overcome scheduling restrictions
0
0,5
1
1,5
Nu
mb
er
of
cycle
s
epic
dec
epic
enc
gsm
dec
gsm
enc
jpegdec
jpegenc
mpeg2dec
mpeg2enc
rasta
Interleaved and MultiVLIW with and without chains
stall time
compute time MULTIVLIW
compute time INTERLEAVEDno chainschains
ICS’02
UPC
Results (III)Results (III)
0
20
40
60
80
100
120
140
rem
ote
hit
s
epic
dec
epic
enc
gsm
dec
gsm
enc
jpegdec
jpegenc
mpeg2dec
mpeg2enc
rasta
Remote hit classification with no Abuffers and with ABuffers
OTHER
CLSC
CLOC
ABuffersNo ABuffers
Local hits are increased by 15% Stall time reduced by 30%
ICS’02
UPC
ConclusionsConclusions
Scheduling Algorithms– Good latency assignment process (stall time accounts
for 9% of execution time)– Coherence kept through memory dependent chains
(5% cycle count degradation) Attraction Buffers
– Effective to increase local hits (15% average) + reduce stall time (30% average)
– Reduce remote hits to previously accessed subblocks (70% average)
Cycle count results – similar to Unified Cache and MultiVLIW
ICS’02
UPC
QuestionsQuestions