A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo...

18
A Data Cache with A Data Cache with Dynamic Mapping Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto

Transcript of A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo...

Page 1: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.

A Data Cache with A Data Cache with Dynamic MappingDynamic MappingA Data Cache with A Data Cache with Dynamic MappingDynamic Mapping

P. D'Alberto, A. Nicolau and A. Veidenbaum

ICS-UCI

Speaker

Paolo D’Alberto

Page 2: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.

Problem IntroductionProblem IntroductionProblem IntroductionProblem Introduction

● Blocked algorithms have good performance on average– Because exploiting data temporal locality

● For some input sets data cache interference nullifies cache locality

Page 3: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.

Problem Introduction, cont.Problem Introduction, cont.Problem Introduction, cont.Problem Introduction, cont.

Blocked Matrix Multiplication: 16KB 1-way data cache

0

2

4

6

8

10

12

14

16

18

Matrix Size

mis

s r

ate

%

regular matrix multiply dynamic matrix multiply

Page 4: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.

Problem IntroductionProblem IntroductionProblem IntroductionProblem Introduction

● What if we remove the spikes– The average performance improves– Execution time is predictable

● We can achieve our goal by: ● Only Software● Only Hardware● Both HW-SW

Page 5: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.

Related Work (Software)Related Work (Software)Related Work (Software)Related Work (Software)

● Data layout reorganization [Flajolet et al. 91]– Data are reorganized before & after computation

● Data copy [Granston et al. 93]– Data are moved in memory during computation

● Padding [Panda et al. 99]● Computation reorganization [Pingali et al.02]

– e.g., Tiling

Page 6: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.

Related Work (Hardware)Related Work (Hardware)Related Work (Hardware)Related Work (Hardware)

● Changing cache mapping – Using a different cache mapping functions [Gonzalez 97]– Increasing cache associativity [IA64]– Changing cache Size

● Bypassing caches– No interference: data are not stored in cache. [MIPS R5K]– HW-driven Pre-fetching

Page 7: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.

Related Work (HW-SW)Related Work (HW-SW)Related Work (HW-SW)Related Work (HW-SW)● Profiling

– Hardware adaptation [UCI]

– Software adaptation [Gatlin et al. 99]

● Pre-fetching [Jouppi et al.]– Latency hiding mostly, used also for cache interference

reduction

● Static Analysis [Ghosh et al. 99 - CME]– e.g., compiler driven data cache line adaptation [UCI]

Page 8: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.

Dynamic Mapping, (Software)Dynamic Mapping, (Software)Dynamic Mapping, (Software)Dynamic Mapping, (Software)

● We consider applications where memory references are affine functions only

● We associate a memory reference with a twin affine function

● We use the twin function as input address for the target data cache

● We use the original affine function to access memory

Page 9: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.

Example of twin functionExample of twin functionExample of twin functionExample of twin function● We consider the references A[i][j] and B[i][j]

– The affine functions are: ● A0+(iN+j)*4 ● B0+(iN+j)*4

● When there is interference (i.e.,|A0-B0| mod C < L where C and

L are cache and cache line size): – We use the twin functions

● A0+(iN+j)*4 ● B0+(iN+j)*4+L

Page 10: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.

Dynamic Mapping, (Hardware)Dynamic Mapping, (Hardware)Dynamic Mapping, (Hardware)Dynamic Mapping, (Hardware)● We introduce a new 3-address load instruction

– A register destination – Two register operands: the results of twin function

and of original affine function ● Note:

– the twin function result may be no real address – the original function is a real address

● (and goes though TLB – ACU)

Page 11: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.

Pseudo Assembly CodePseudo Assembly CodePseudo Assembly CodePseudo Assembly Code

ORIGINAL CODESet $R0, A_0Set $R1, B_0…Load $F0, $R0Load $F1, $R1Add $R0,$R0,4Add $R1,$R1,4

MODIFIED CODESet $R0, A_0

Set $R1, B_0

Add $R2, $R1, 32…

Load $F0, $R0

Load $F1, $R1, $R2Load $F1, $R1, $R2

Add $R2, $R2, 4

Add $R0, $R0, 4

Add $R1, $R1, 4

Page 12: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.

Experimental ResultsExperimental ResultsExperimental ResultsExperimental Results● We present experimental results obtained by

using combination of software approaches:– Padding– Data Copy – Without using any cycle-accurate simulator

● Matrix multiplication: – Simulation of cache performance a data cache size

16KB 1-way for optimally blocked algorithm

Page 13: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.

Matrix Multiply (simulation) Matrix Multiply (simulation) Matrix Multiply (simulation) Matrix Multiply (simulation)

Blocked Matrix Multiplication: 16KB 1-way data cache

0

2

4

6

8

10

12

14

16

18

Matrix Size

mis

s r

ate

%

regular matrix multiply dynamic matrix multiply

Page 14: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.

Experimental Results, cont.Experimental Results, cont.Experimental Results, cont.Experimental Results, cont.● n-FFT, Cooley-Tookey algorithm using balanced

decomposition in factors – The algorithm has been proposed first by Vitter et al– Complexity

● Best case O(n log log n) - Worst case O(n2)

– Normalized performance (MFLOPS)● We use the codelets from FFTW● For 128KB 4-way data cache

– Performance comparison with FFTW is in the paper

Page 15: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.

FFT FFT 128KB 4-way data cache128KB 4-way data cache

FFT FFT 128KB 4-way data cache128KB 4-way data cache

SPARC64 100MHz

02468

10121416

1000

1960

4725

1036

8

2700

0

7560

0

1653

75

3628

80 128

256

51210

2420

4840

9681

92

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

2097

152

4194

304

8388

608

size

no

rmal

ized

MF

LO

PS

Upper Bound Dynamic Mapping Base Line

Page 16: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.

Future workFuture workFuture workFuture work

● Dynamic Mapping is not fully automated– The code is hand made

● A clock-accurate processor simulator is missing– To estimate the effects of twin function

computations on performance and energy

● Application on a set of benchmarks

Page 17: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.
Page 18: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.

ConclusionsConclusionsConclusionsConclusions● The hardware is relatively simple

– Because it is the compiler (or user) that activates the twin computation

● and change the data cache mapping dynamically

● The approach aims to achieve a data cache mapping with:– zero interference,– no increase of cache hit latency – minimum extra hardware