Increasing Cache Efficiency by Eliminating Noise Prateek Pujara & Aneesh Aggarwal {prateek,...

45
Increasing Cache Efficiency by Eliminating Noise Prateek Pujara & Aneesh Aggarwal {prateek, aneesh } @binghamton . edu http://caps.cs.binghamton.edu State University of New York, Binghamton

Transcript of Increasing Cache Efficiency by Eliminating Noise Prateek Pujara & Aneesh Aggarwal {prateek,...

Increasing Cache Efficiency by Eliminating Noise

Prateek Pujara & Aneesh Aggarwal {prateek, aneesh}@binghamton.edu

http://caps.cs.binghamton.eduState University of New York, Binghamton

INTRODUCTION Caches are very important to cover the

processor-memory performance gap. Thus Caches should be utilized very efficiently.

Fetch only the useful data

Cache Utilization : Percentage of the useful words out of the total words fetched into the cache.

Utilization vs Block-Size

Larger Cache Blocks• Increase bandwidth requirement

• Reduce utilization

Smaller Cache Blocks• Reduces bandwidth requirement

• Increase utilization

Percent Cache Utilization

16KB, 4-way Set Associative Cache, 32 byte block size

0.000

10.000

20.000

30.000

40.000

50.000

60.000

70.000

80.000

90.000

100.000

Bzip2Gcc Mcf

ParserVortex

Vpr

AmmpAppluApsi

Art

EquakeMgridSwim

Wupwise

Int AverageFP Average

Average

Percentage

Percentage Utilization

Methods to improve utilization

Rearrange data/code

Dynamically adapt cache line size

Sub-blocking

Benefits of Utilization improvement

Lower energy consumptionBy avoiding wastage of energy on

useless words. Improve performance

By better utilizing the available cache space.

Reduce memory trafficBy not fetching useless words.

Our Goal

Improve Utilization Predict the to-be-referenced words Avoid cache pollution by fetching only the

predicted words

Our Contributions Illustrate high predictability of cache noise Propose efficient cache noise predictor Show potential benefits of cache noise prediction

based fetching in terms of• Cache utilization• Cache power consumption• Bandwidth requirement

Illustrate benefits of cache noise prediction for prefetching

Investigate cache noise prediction as an alternative to sub-blocking

Cache Noise Prediction

Programs repeat the pattern of memory references.

Predict cache noise based on the history of words accessed in the cache blocks.

Cache Noise Predictors1) Phase Context Predictor (PCP) Records the words usage history of the most recently evicted

cache block.

2) Memory Context Predictor (MCP) Assuming that data accessed from contiguous memory locations

will be accessed in same fashion.

3) Code Context Predictor (CCP) Assuming that instructions in a particular portion of the code will

access data in same fashion.

Cache Noise Predictors

For code context predictors • Use higher order bits of PC as context

Store the context along with the cache block.

Add 2 bit vectors for each cache block• One for identifying the valid words present

• One for storing the access pattern

Code Context Predictor (CCP)

Say PC of an instruction is 1001100100

Code Context Predictor (CCP)

Say PC of an instruction is 1001100100

X (100110)Code Context:

Code Context Predictor (CCP)

Say PC of an instruction is 1001100100

X (100110)

1 1 0 0 1

Code Context:

Last Word Usage History

Valid-Bit

Code Context Predictor (CCP)

Say PC of an instruction is 1001100100

X (100110)

1 1 0 0

Z (xxxxxx)Y (101001)

1 0 0 1 x x x x1 1 0

Code Context:

Code Context Predictor (CCP)

Say PC of an instruction is 1001100100

X (100110)

1 1 0 0

Z (xxxxxx)Y (101001)

1 0 0 1 x x x x1 1 0

Code Context:

Miss due to PC 1001100100 Only 1st and 2nd words are broughtEvicted cache block was brought by PC 101110 and used only 1st word

Code Context Predictor (CCP)

Say PC of an instruction is 1001100100

X (100110)

1 1 0 0

Z (xxxxxx)Y (101001)

1 0 0 1 x x x x1 1 0

Code Context:

Miss due to PC 1001100100 Only 1st and 2nd words are broughtEvicted cache block was brought by PC 101110 and used only 1st word

Code Context Predictor (CCP)

Say PC of an instruction is 1001100100

X (100110)

1 1 0 0

Z (101110)Y (101001)

1 0 0 1 1 0 0 01 1 1

Code Context:

Code Context Predictor (CCP)

Say PC of an instruction is 1001100100

X (100110)

1 1 0 0

Z (101110)Y (101001)

1 0 0 1 1 0 0 01 1 1

Code Context:

Miss due to PC 1011101100Only 1st word broughtEvicted block was brought by PC 101001 and used 2nd and 4th word

Code Context Predictor (CCP)

Say PC of an instruction is 1001100100

X (100110)

1 1 0 0

Z (101110)Y (101001)

1 0 0 1 1 0 0 01 1 1

Code Context:

Miss due to PC 1011101100Only 1st word broughtEvicted block was brought by PC 101001 and used 2nd and 4th word

Code Context Predictor (CCP)

Say PC of an instruction is 1001100100

X (100110)

1 1 0 0

Z (101110)Y (101001)

0 1 0 1 1 0 0 01 1 1

Code Context:

Predictability of CCP

PCP - 56%

MCP - 67%

Predictability = Correct prediction/Total missesNo prediction almost 0%

0

10

20

30

40

50

60

70

80

90

100

Bzip2Gcc Mcf

ParserVortex

Vpr

AmmpAppluApsi

Art

EquakeMgridSwim

Wupwise

Int AverageFP Average

Average

Percentage

CCP(30bits) CCP(28bits) CCP(26bits)

Improving the Predictability

Miss Initiator Based History (MIBH)Words usage history based on the

offset of the word that initiated the miss.

ORing Previous Two Histories (OPTH)Bitwise ORing past two histories.

Predictability of CCP

The predictability of PCP and MCP was about 68% and 75% respectively using both MIBH and OPTH.

0

10

20

30

40

50

60

70

80

90

100

Bzip2Gcc Mcf

ParserVortex

Vpr

AmmpAppluApsi

Art

EquakeMgridSwim

Wupwise

Int AverageFP Average

Average

Percentage

CCP(30bits) – MIBH CCP(28bits) – MIBH

CCP Implementation

contextvalid-bit

wordshistory

usage

CCP Implementation

wordshistory

wordshistory

usageusagecontext

valid-bit valid-bitMIWOMIWO

MIWO -- Miss Initiator Word Offset

CCP Implementation

read/write portread/write port

wordshistory

wordshistory

usageusagecontext

valid-bit valid-bitMIWOMIWO

broadcast tag

MIWO -- Miss Initiator Word Offset

CCP Implementation

read/write portread/write port

wordshistory

wordshistory

usageusagecontext

valid-bit valid-bitMIWOMIWO

broadcast tag

MIWO -- Miss Initiator Word Offset

== =

CCP Implementation

read/write portread/write port

wordshistory

wordshistory

usageusagecontext

valid-bit valid-bitMIWOMIWO

broadcast tag

MIWO -- Miss Initiator Word Offset

== =

CCP Implementation

read/write portread/write port

wordshistory

wordshistory

usageusagecontext

valid-bit valid-bitMIWOMIWO

broadcast tag

MIWO -- Miss Initiator Word Offset

== =

CCP Implementation

read/write portread/write port

wordshistory

wordshistory

usageusagecontext

valid-bit valid-bitMIWOMIWO

broadcast tag

MIWO -- Miss Initiator Word Offset

== =

CCP Implementation

read/write portread/write port

wordshistory

wordshistory

usageusagecontext

valid-bit valid-bitMIWOMIWO

broadcast tag

MIWO -- Miss Initiator Word Offset

== =

CCP Implementation

read/write portread/write port

wordshistory

wordshistory

usageusagecontext

valid-bit valid-bitMIWOMIWO

broadcast tag

MIWO -- Miss Initiator Word Offset

== =

CCP Implementation

read/write portread/write port

wordshistory

wordshistory

usageusagecontext

valid-bit valid-bitMIWOMIWO

broadcast tag

MIWO -- Miss Initiator Word Offset

== =

Experimental Setup Applied noise prediction to L1 data cache L1 Dcache of 16KB 4-way associative 32byte

block size Unified L2 cache of 512KB 8-way associative

64 byte block size L1 Icache of 16KB direct mapped ROB - 256 instructions LSB - 64 entries Issue Queue - 96 Int/64 FP

Prediction Accuracies with 32/4, 16/8 & 16/4 CCP

0

10

20

30

40

50

60

70

80

90

100

Int Avg FP AvgAverage

Percentage

Correct Prediction Misprediction No Prediction32

/4

16/8

16/4

RESULTS

0

1

2

3

4

5

6

7

8

9

Bandwidth

BASE CCP

0

10

20

30

40

50

60

70

80

90

Utilization

BASE CCP

0

0.2

0.4

0.6

0.8

1

1.2

IPC

BASE CCP

9

9.2

9.4

9.6

9.8

Miss Rate

BASE CCP

Percentage Dynamic Energy Savings

-25

-15

-5

5

15

25

35

45

55

65

Bzip2Gcc Mcf

ParserVortexVpr

AmmpAppluApsi Art

EquakeMgridSwim

WupwiseInt AverageFP Average

Average

Percentage Savings

Energy Savings

Prefetching

Processors employ prefetching to improve the cache miss rate.

• Fetch the next cache block on a miss to exploit spatial locality.

The prefetched cache block is predicted to have the same pattern as that of the currently fetched block.

Prefetching Prefetched cache block updates the

predictor table when evicted. Prefetched cache block is stored without any

context information Whenever it is accessed for the first time, the

context and the offset information is stored Prefetched block does not update the

predictor table when evicted.

Prediction Accuracy with Prefetching

Energy consumptionreduced by about 22%

Utilization increasedby about 70%

Miss Rate increasedonly by about 2%

0

10

20

30

40

50

60

70

80

90

100

Int Avg FP AvgAverage

Correct Prediction Misprediction No Prediction

No

Pre

fetc

hing

No

Upd

ate

Upd

ate

Sub-blocking

Sub-blocking is used to• Reduce cache noise

• Reduce bandwidth requirement

Limitations of sub-blocking• Increased miss rate

Can we use cache noise prediction as an alternative to sub-blocking?

Cache Noise Prediction vs Sub-blocking

0

2

4

6

8

10

12

14

16

18

20

Miss Rate

Sub-block CCP

64

66

68

70

72

74

76

78

Utilization

Sub-block CCP

0

5

10

15

20

25

30

35

Energy Savings

Sub-block CCP

0

1

2

3

4

5

6

7

Bandwidth

Sub-block CCP

Conclusion Cache noise is highly predictable. Proposed cache noise predictors.

• CCP achieves 75% prediction rate with correct prediction of 97% using a small 16 entry table.

Prediction without impact on IPC and minimal impact (0.1%) on miss rate.

Very effective with prefetching. Compared to sub-blocking cache noise

prediction based fetching improves • Miss rate by 97% and Utilization by 10%

QUESTIONS ???

Prateek Pujara & Aneesh Aggarwal {prateek, aneesh}@binghamton.edu

http://caps.cs.binghamton.eduState University of New York, Binghamton