Download - Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines

1

Line Distillation: Increasing Cache Capacity by

Filtering Unused Words in Cache Lines

Moinuddin K. Qureshi M. Aater Suleman

Yale N. Patt

HPCA 2007

2

Introduction

Caches are organized at linesize granularity Helps when spatial locality is high

Unused words when spatial locality is low

Unused words occupy space without contributing to cache hits

Filtering unused words allows cache to store more cache lines

3

Problem: Not all words are useful

On average less than 60% words used (4.7/8)

Cache line (64B) divided into 8 words of 8B each(1 MB 8-way L2 cache)

Word

s u

sed

per

line (

avg

)

4

Goal: Improving cache performance

Smaller linesize can result in fewer unused words

Smaller linesize degrades cache performance

Linesize of 32B increases MPKI for 14 of 16 benchmarksAverage MPKI increases by 25%

Insight:Words usage stabilizes as line traverses from MRU to LRU

Goal: Improving cache performance by filtering unused words

5

Insight

Footprint = 8-bits per line that tracks word usage

Most footprint updates occurearly in recency stack

Max recency position before footprint update

78%

5%

6%

11%

MRUPos 1Pos 2Pos 3Pos 4Pos 5Pos 6LRU

Recency Stack

Line Distillation (LDIS):Evict unused words when

line crosses certain recency

6

Outline

Background Line Distillation Experimental Evaluation Interaction with Compression Related Work and Summary

7

Framework for LDIS

PROCESSOR

ICACHE DCACHE

footprint

LOC WOC

L2 Cache

Distill Cache

valid bits

(sectored)

Line Organized Cache Word Organized Cache

Line from memory

8

Distill Cache (Operation)

Traditional cache (4-way)

LOC WOC

MRU LRU

B AC

Four cases:1. Cache Miss: Access to line D2. LOC Hit: Access to line B3. WOC Hit: Access to line A (word A0)4. Hole Miss: Access to line A (word

A1)Words used? Evict

A[1:6]Install A0,A7

(A0,A7 used)

Install Line D in LOC and update LRU state

Same as traditional cache

Send A0 and A7 to L1 and valid bitsInvalidate all words of A in WOC.

Fetch A from Memory and install in LOC

DA0,A7

9

Median Threshold Filtering

A line with many used words can evict several lines from WOC

A0 B0 C0 D0 E0 F0 G0 H0Line X has all 8 words used

X0 X1 X2 X3 X4 X5 X6 X78 Lines

evicted from WOC

WOC

Increase lines in WOC by not installing lines for which used words > threshold “K”

K = median words used in LOC line (computed at runtime)

10

Outline


11

Methodology

Configuration:

L2 cache: 1MB 8-way 64B linesize

(Distill cache gives 6 ways to LOC and 2 ways to WOC)

Out-of-order processor with 16KB 2-way L1s

400 cycle memory

Benchmarks:

15 SPEC2K benchmarks + health from olden suite

(A 250M instruction slice using SimPoint for SPEC2K)

12

ResultsLDIS (No MT) LDIS (with MT)

LDIS (MT) reduces MPKI by 25%

(%)

Reduct

ion

in L

2 M

PK

I

13

Reverter Circuit (RC)

Tournament selection: Distill cache vs. traditional cache Dynamic set sampling with 32 sets [Qureshi+ ISCA’06]

For sets A, C, D, F, H:if (SCTR > 75%) Enable LDISif (SCTR < 25%) Disable LDIS

ATD-LRU

Distill cache

Set B

Set E

Set G

Set A

Set CSet D

Set F

Set H

Set B

Set E

Set G

Set A

Set CSet D

Set F

Set H

Set BSet ESet G

SCTR

- +

(storage overhead of ATD: 1KB)

14

Results with RCLDIS (MT, No RC) LDIS (MT,RC)

RC disables LDIS when it increases MPKI.

LDIS (MT,RC) reduces MPKI by 30%

(%)

Reduct

ion

in L

2 M

PK

I

15

Overheads

Storage Tags for WOC + footprint bits: 12.2%

overhead

LatencyTag-access (LOC+WOC) increases by one

cycle WOC hits incur two cycles to rearrange words

PowerAdditional power of WOC tag-store

16

IPC Results

LDIS improves average IPC by 12%

(%)

IPC

Im

pro

vem

en

t

17

Outline


18

Compression vs. LDIS

Several proposals to increase capacity via compression

Compression and LDIS fundamentally different Compression exploits redundancy in stored data LDIS leverages unused words for spare capacity

Footprint Aware Compression (FAC) combines both

FAC compresses used words before installing in WOC

19

Results for FAC

Compression and LDIS interact positively.

FAC reduces MPKI by 50%

LDIS Compression FAC

(%)

Reduct

ion

in L

2 M

PK

I

50

40

30

20

10

0

20

Outline


21

Related work

Spatial-Temporal Cache -Gonzales+ [ICS’95]

Spatial Locality Prediction –Johnson+ [ISCA’97]

Variable Linesize Cache –Veidenbaum+ [ICS’99]

Spatial Footprint Prediction –Kumar+ [ISCA’98], Pujara+ [HPCA’06]

Spatial Pattern Prediction -Chen+ [HPCA’05]

LDIS is particularly suited for large caches and outperforms predictor-based techniques without

requiring separate structure for tracking spatial footprint

22

Contributions

Line Distillation: Filter unused words without a separate footprint predictor

Distill cache: Utilize extra capacity created by LDIS

Median Threshold Filtering and Reverter Circuit: Improve performance and robustness of LDIS Result: LDIS (MT+RC) reduces MPKI by 30%

Footprint Aware Compression: LDIS + compressionResult: FAC reduces MPKI by 50%

23

Questions

24

Result comparing capacity

25

Line Size vs. MPKI

26

Distribution of Hit-Miss

27

Average words usage (detailed)

28

Result for 3 types of LDIS

29

Replacement

LRU in LOC

WOC needs variable sized replacement

Only power-of-two sizes allowed in WOC

Placement constrained to alignment boundary

Random selection in case of multiple candidates

30

Background (pictorial)

31

Result LDIS vs. FAC (detailed)

32

Comparison with SFP

33

Appendix A: Other SPEC Benchmarks

34

Appendix B: Cache Size vs. Density

35

Summary

Many words in cache lines remain unused

Unused words unlikely to be accessed in less recent part of LRU stack Line Distillation (LDIS)

Distill-cache utilizes extra capacity created by LDIS

LDIS reduces MPKI by 30% and improves IPC by 12%

“Footprint Aware Compression” combines LDIS and compression to reduce MPKI by 50%