ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González...

ICS’02

UPC

An Interleaved Cache Clustered VLIW Processor

An Interleaved Cache Clustered VLIW Processor

E. Gibert, J. Sánchez* and A. González*

Dept. d’Arquitectura de Computadors

Universitat Politècnica de Catalunya (UPC)* Also at Intel Barcelona Research Center

June 2002

ICS’02

UPC

MotivationMotivation

Capacity-bound vs. Communication-bound Solution: clustered microarchitectures

• Partition some hardware resources• Simpler + faster

• Power consumption

• Communications not homogeneous

Goal: clustering the memory hierarchy in statically scheduled processors

Motivation

ICS’02

UPC

Talk OutlineTalk Outline

State-of-the-art: multiVLIW Interleaved Cache Clustered VLIW Scheduling Algorithms Enhancement: Attraction Buffers Experimental Framework Results Conclusions

ICS’02

UPC

State-of-the-art: MultiVLIWState-of-the-art: MultiVLIW

Sánchez and González [MICRO’00]

Reg. File

F.U.

L1 datacache

Clu

ste r

1 Reg. File

F.U.

L1 datacache

Clu

ste r

2 Reg. File

F.U.

L1 datacache

Clu

ste r

nCoherency network

...

Register-to-register buses

Next memory levelNext memory level

ICS’02

UPC



ICS’02

UPC

Basic Interleaved Cache Clustered VLIW Processor

Basic Interleaved Cache Clustered VLIW Processor

Reg. FileReg. File

FUsFUs

TAG W0 W4

cache module

Reg. FileReg. File

FUsFUs

TAG W1 W5

cache module

Reg. FileReg. File

FUsFUs

TAG W2 W6

cache module

Reg. FileReg. File

FUsFUs

TAG W3 W7

cache module

TAG W0 W1 W2 W4 W5 W6 W7W3

Subblock 1memory buses

NEXT MEMORY LEVELcacheblock

Register-to-register buses

CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4

ICS’02

UPC



ICS’02

UPC

Modulo SchedulingModulo Scheduling

Extract ILP from loops overlap execution of iterations

AA

BB

CC

AA

BB

CC

A’A’

B’B’

C’C’

A’’A’’

B’’B’’

C’’C’’

II

SC

Kernel

LOOP L

ICS’02

UPC

Base Scheduling AlgorithmBase Scheduling Algorithm

Used for Unified Cache

II=II+1

Best profit inoutput edges

START

Sort nodes

Next nodeSelect possible

clusters HowMany?

Least loaded

Schedule it HowMany?

>0

>1

1

0

ICS’02

UPC

Interleaved Cache Scheduling Algorithm

Interleaved Cache Scheduling Algorithm

Unroll loop to maximize instructions with a stride multiple of NxI access ONE cache module

Assign latencies to memory instructions Assign memory instructions to clusters:

– IPBC (Interleaved Pre-Build Chains) minimize stall time

– IBC (Interleaved Build Chains) minimize compute time

ICS’02

UPC

Memory Dependent Instructions

Memory Dependent Instructions

store

load

add

load

add

store

store

load load

store

memorydependant

chain 1

memorydependant

chain 2

IPBC preferred info is usedvs.

IBC minimize register comms.Preferred=1

Preferred=1

Preferred=2

Preferred=2

ICS’02

UPC



ICS’02

UPC

LocalData

LocalData ABufferABuffer

loca

l log

ic

data hit

data data hithit

ADDRESS

TAG W2 W6

=

TAG W

ADDRESS

datahit

ATTRACTION BUFFER

word select

CACHE MODULE

Enhacement: Attraction Buffers


ICS’02

UPC

for (i=0; i<MAX; i++) { ld r3, a[i] r4 = OP(r3) st r4, b[i]}

for (i=0; i<MAX; i+=4) { ld r31, a[i] (stride 16) ld r32, a[i+1] ld r33, a[i+2] ld r34, a[i+3] r41 = OP(r31) r42 = OP(r32) r43 = OP(r33) r44 = OP(r34) st r41, b[i] st r42, b[i+1] st r43, b[i+2] st r44, b[i+3]}

16 byte strides (NxI multiple)N = 4 clusters, I= 4 bytes

Unroll x4

An ExampleAn Example

a[3] a[7] a[0] a[4]

CLUSTER 4

ABufferLocal module

ld r31, a[0]

CL

US

TE

R 3

CL

US

TE

R 2

CL

US

TE

R 1

a[0] a[1] a[2] a[3] ...

ICS’02

UPC



Why remote accesses? Why Attraction Buffers?– Double precision accesses low benefit– Indirect accesses: a[b[i]] low benefit– “Unclear” preferred cluster big benefit

for (i=0; i<MAX; i++)for (k=i; k<i+MAX; k+=4)

ld a[k], ld a[k+1], ld a[k+2], ld a[k+3]

– Memory dependent chains big benefit

– IBC: preferred cluster info is not used big benefit

ICS’02

UPC



ICS’02

UPC

Experimental FrameworkExperimental Framework

IMPACT C compiler Modulo scheduling on hyperblock loops

– BASE for a Unified Cache– IPBC and IBC for an Interleaved Cache– IPBC and IBC for the MultiVLIW– The same unrolling factor has been used for

all architecture configurations! Mediabench benchmark suite

ICS’02

UPC

Experimental FrameworkExperimental Framework

Number of clusters 4

Functional units 1 FP / cluster + 1 int / cluster

+ 1 mem / cluster

Cache configuration 8KB, 32-byte lines, 2-way set associative, 1 cycle latency

Reg-to-reg communication buses

4 buses that run at ½ the core frequency

Memory buses 4 buses that run at ½ (or ¼)

the core frequency

Next memory level 4 ports, 5 cycle latency, always hit

Interleaving factor

(Interleaved Cache)

4 bytes

Latencies 1-10 (Unified Cache + MultiVLIW)

1-(5/6)-10-15 (Interleaved Cache)

ICS’02

UPC

Results (I)Results (I)

IPBC vs IBC similar cycle count results MultiVLIW vs Interleaved similar results BUT…

… lower complexity!

0

0,5

1

1,5

2

Nu

mb

er

of

cyckes

epic

dec

epic

enc

gsm

dec

gsm

enc

jpegdec

jpegenc

mpeg2dec

mpeg2enc

rasta

Comparison with Unified Cache and MultiVLIW

stall time

compute time MULTIVLIW

compute time INTERLEAVED1/2 1/4

ICS’02

UPC

Results (II)Results (II)

Memory dependent chains– Interleaved cache workload unbalance + remote accesses

– MultiVLIW workload unbalance

– Working on techniques to overcome scheduling restrictions

0

0,5

1

1,5

Nu

mb

er

of

cycle

s

epic

dec

epic

enc

gsm

dec

gsm

enc

jpegdec

jpegenc

mpeg2dec

mpeg2enc

rasta

Interleaved and MultiVLIW with and without chains

stall time

compute time MULTIVLIW

compute time INTERLEAVEDno chainschains

ICS’02

UPC

Results (III)Results (III)

0

20

40

60

80

100

120

140

rem

ote

hit

s

epic

dec

epic

enc

gsm

dec

gsm

enc

jpegdec

jpegenc

mpeg2dec

mpeg2enc

rasta

Remote hit classification with no Abuffers and with ABuffers

OTHER

CLSC

CLOC

ABuffersNo ABuffers

Local hits are increased by 15% Stall time reduced by 30%

ICS’02

UPC

ConclusionsConclusions

Scheduling Algorithms– Good latency assignment process (stall time accounts

for 9% of execution time)– Coherence kept through memory dependent chains

(5% cycle count degradation) Attraction Buffers

– Effective to increase local hits (15% average) + reduce stall time (30% average)

– Reduce remote hits to previously accessed subblocks (70% average)

Cycle count results – similar to Unified Cache and MultiVLIW

ICS’02

UPC

QuestionsQuestions

ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González...

Documents

Transcript of ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González...