A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo...

A Data Cache with A Data Cache with Dynamic MappingDynamic MappingA Data Cache with A Data Cache with Dynamic MappingDynamic Mapping

P. D'Alberto, A. Nicolau and A. Veidenbaum

ICS-UCI

Speaker

Paolo D’Alberto

Problem IntroductionProblem IntroductionProblem IntroductionProblem Introduction

● Blocked algorithms have good performance on average– Because exploiting data temporal locality

● For some input sets data cache interference nullifies cache locality

Problem Introduction, cont.Problem Introduction, cont.Problem Introduction, cont.Problem Introduction, cont.

Blocked Matrix Multiplication: 16KB 1-way data cache

Matrix Size

regular matrix multiply dynamic matrix multiply

Problem IntroductionProblem IntroductionProblem IntroductionProblem Introduction

● What if we remove the spikes– The average performance improves– Execution time is predictable

● We can achieve our goal by: ● Only Software● Only Hardware● Both HW-SW

Related Work (Software)Related Work (Software)Related Work (Software)Related Work (Software)

● Data layout reorganization [Flajolet et al. 91]– Data are reorganized before & after computation

● Data copy [Granston et al. 93]– Data are moved in memory during computation

● Padding [Panda et al. 99]● Computation reorganization [Pingali et al.02]

– e.g., Tiling

Related Work (Hardware)Related Work (Hardware)Related Work (Hardware)Related Work (Hardware)

● Changing cache mapping – Using a different cache mapping functions [Gonzalez 97]– Increasing cache associativity [IA64]– Changing cache Size

● Bypassing caches– No interference: data are not stored in cache. [MIPS R5K]– HW-driven Pre-fetching

Related Work (HW-SW)Related Work (HW-SW)Related Work (HW-SW)Related Work (HW-SW)● Profiling

– Hardware adaptation [UCI]

– Software adaptation [Gatlin et al. 99]

● Pre-fetching [Jouppi et al.]– Latency hiding mostly, used also for cache interference

reduction

● Static Analysis [Ghosh et al. 99 - CME]– e.g., compiler driven data cache line adaptation [UCI]

Dynamic Mapping, (Software)Dynamic Mapping, (Software)Dynamic Mapping, (Software)Dynamic Mapping, (Software)

● We consider applications where memory references are affine functions only

● We associate a memory reference with a twin affine function

● We use the twin function as input address for the target data cache

● We use the original affine function to access memory

Example of twin functionExample of twin functionExample of twin functionExample of twin function● We consider the references A[i][j] and B[i][j]

– The affine functions are: ● A0+(iN+j)*4 ● B0+(iN+j)*4

● When there is interference (i.e.,|A0-B0| mod C < L where C and

L are cache and cache line size): – We use the twin functions

● A0+(iN+j)*4 ● B0+(iN+j)*4+L

Dynamic Mapping, (Hardware)Dynamic Mapping, (Hardware)Dynamic Mapping, (Hardware)Dynamic Mapping, (Hardware)● We introduce a new 3-address load instruction

– A register destination – Two register operands: the results of twin function

and of original affine function ● Note:

– the twin function result may be no real address – the original function is a real address

● (and goes though TLB – ACU)

Pseudo Assembly CodePseudo Assembly CodePseudo Assembly CodePseudo Assembly Code

ORIGINAL CODESet $R0, A_0Set $R1, B_0…Load $F0, $R0Load $F1, $R1Add $R0,$R0,4Add $R1,$R1,4

MODIFIED CODESet $R0, A_0

Set $R1, B_0

Add $R2, $R1, 32…

Load $F0, $R0

Load $F1, $R1, $R2Load $F1, $R1, $R2

Add $R2, $R2, 4

Add $R0, $R0, 4

Add $R1, $R1, 4

Experimental ResultsExperimental ResultsExperimental ResultsExperimental Results● We present experimental results obtained by

using combination of software approaches:– Padding– Data Copy – Without using any cycle-accurate simulator

● Matrix multiplication: – Simulation of cache performance a data cache size

16KB 1-way for optimally blocked algorithm

Matrix Multiply (simulation) Matrix Multiply (simulation) Matrix Multiply (simulation) Matrix Multiply (simulation)

Blocked Matrix Multiplication: 16KB 1-way data cache

Matrix Size

regular matrix multiply dynamic matrix multiply

Experimental Results, cont.Experimental Results, cont.Experimental Results, cont.Experimental Results, cont.● n-FFT, Cooley-Tookey algorithm using balanced

decomposition in factors – The algorithm has been proposed first by Vitter et al– Complexity

● Best case O(n log log n) - Worst case O(n2)

– Normalized performance (MFLOPS)● We use the codelets from FFTW● For 128KB 4-way data cache

– Performance comparison with FFTW is in the paper

FFT FFT 128KB 4-way data cache128KB 4-way data cache

SPARC64 100MHz

10121416

80 128

Upper Bound Dynamic Mapping Base Line

Future workFuture workFuture workFuture work

● Dynamic Mapping is not fully automated– The code is hand made

● A clock-accurate processor simulator is missing– To estimate the effects of twin function

computations on performance and energy

● Application on a set of benchmarks

ConclusionsConclusionsConclusionsConclusions● The hardware is relatively simple

– Because it is the compiler (or user) that activates the twin computation

● and change the data cache mapping dynamically

● The approach aims to achieve a data cache mapping with:– zero interference,– no increase of cache hit latency – minimum extra hardware

A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo...

Documents

Transcript of A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo...

Part b week 4 ciara d'alberto 583411

CV D'ALBERTO EN - web.philo.ulg.ac.beweb.philo.ulg.ac.be/transitions/wp-content/uploads/sites/3/2018/02/C… · Righetti (University of Rome "Sapienza"); Prof. M.S. Celentano and

Charles d'alberto as350 b2 brochure

Reducing Power Consumption for High-Associativity Data ...Title Reducing Power Consumption for High-Associativity Data Caches in Embedded Processors Author Dan Nicolaescu, Alex Veidenbaum,

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

Part b week 5 ciara d'alberto 583411

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou, alexv}@ics.uci.edu.

Sketchbook air ciara d'alberto

Adaptive Winograd’s Matrix Multiplicationsfastmm/FMM-Reference/dalberto-nicolau.winograd... · Adaptive Winograd’s Matrix Multiplications PAOLO D’ALBERTO ... our algorithm divides

DFT Compiler for Custom and Adaptable Systemspaolo/Application/Microsoft PowerPoint... · Custom and Adaptable Systems Paolo D’Alberto Electrical and Computer Engineering Carnegie

Paolo D'Alberto Yahoo!paolo/Reference/pp10-winograd... · 2010. 2. 22. · Take any BLAS library: MKL, ATLAS, GotoBLAS - E.g., GotoBLAS - 90-95% of peak performance Nehalem 2 processor

Paolo D’Alberto and Ali Dasdan Yahoo! Inc.paolo/Reference/Presentation2.pdf · Reference Window Reference window: from series to CDF ( F r ) Moving Window : from series to CDF (

Adaptive Techniques for Leakage Power Management in L2 Cache Peripheral Circuits Houman Homayoun Alex Veidenbaum and Jean-Luc Gaudiot Dept. of Computer.

Houman Homayoun, Aseem Gupta, Avesta Sasan, Alex Veidenbaum, Nikil Dutt, Fadi Kurdahi

Jean GENET L’ATELIER d’ALBERTO GIACOMETTITitle: Jean GENET L’ATELIER d’ALBERTO GIACOMETTI Author: Michel Created Date: 12/9/2019 8:40:16 AM

Charles D'Alberto - Perla Group investment review by money cat

THE CASE FOR AFFILIATION - schfma.orgschfma.org/PDFs/2016_Payor/1_2016_Payor_The_Case_for_Affiliatio… · THE CASE FOR AFFILIATION From LCHCS to LCMH (GHS) Rich D’Alberto Campus

Fast-J & Cloud- J -- an update - CESM | Community Earth ... · PDF fileFast-J & Cloud- J -- an update Michael Prather, Juno Hsu, Alex Nicolau, Alex Veidenbaum (UC Irvine) Philip Cameron-Smith

St. John Serving Central Ohio Since 1923 Columbus, OH ... · Commercial, Residential, Apartments & Rehabs Antonio@AccessMoreProperty.com 614-768-2016 Antonio D’Alberto, owner/broker

PACEMAKER TEACHING Maira Abrams Kristen D’Alberto Marijo DiMora Kim Wise.