High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin...

32
High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel , Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation, VSSAD International Symposium on Computer Architecture ( ISCA – 2010 )

Transcript of High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin...

Page 1: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation,

High Performance Cache Replacement Using

Re-Reference Interval Prediction (RRIP)

Aamer Jaleel, Kevin Theobald,

Simon Steely Jr., Joel Emer

Intel Corporation, VSSAD

International Symposium on Computer Architecture ( ISCA – 2010 )

Page 2: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation,

Motivation

• Factors making caching important• Increasing ratio of CPU speed to memory speed• Multi-core poses challenges on better shared cache

management

• LRU has been the standard replacement policy at LLC• However LRU has problems!

22

Page 3: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation,

Problems with LRU Replacement

33

Working set larger than the cache causes thrashing

References to non-temporal data (scans) discards frequently referenced working set

miss miss miss missmiss

hit hit hit miss hit hitmiss missscan scan scan

Wsize

Wsize

LLCsize

Our studies show that scans occur frequently in many commercial workloads

Page 4: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation,

hit hit hit hit hit

Desired Behavior from Cache Replacement

44

mis

s

mis

s

mis

s

mis

s

mis

s

Working set larger than the cache Preserve some of working set in the cache

Recurring scans Preserve frequently referenced working set in the cache

hit hit hit hit hitscan scan scanhit hit hit

Wsize

LLCsize

Page 5: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation,

Prior Solutions to Enhance Cache Replacement

55

Dynamic Insertion Policy (DIP) Thrash-resistance with minimal changes to HW

GOAL: Design a High Performing Scan-Resistant Policy that Requires Minimum Changes to HW

Least Frequently Used (LFU) addresses scans

LFU adds complexity and also performs bad for recency friendly workloads

Working set larger than the cache Preserve some of working set in the cache

Recurring scans Preserve frequently referenced working set in the cache

Page 6: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation,

Belady’s Optimal (OPT) Replacement Policy

• Replacement decisions using perfect knowledge of future reference order

• Victim Selection Policy: • Replaces block that will be re-referenced furthest in future

66

Cache Tag a c b h f d g e

4 13 11 5 3 6 9 1“Time” when block will be referenced

next

0 1 2 3 4 5 6 7Physical Way #

victim block

Page 7: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation,

Practical Cache Replacement Policies

• Replacement decisions made by predicting the future reference order

• Victim Selection Policy: • Replace block predicted to be re-referenced furthest in future

• Continually update predictions on the future reference order• Natural update opportunities are on cache fills and cache hits

77

Cache Tag a c b h f d g e

0 1 2 3 4 5 6 7Physical Way #

victim block

4 13 11 5 3 6 9 1“Predicted Time” when block

will be referenced next~ ~ ~ ~ ~ ~ ~ ~

Page 8: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation,

LRU Replacement in Prediction Framework

88

• The “LRU chain” maintains the re-reference prediction

• Head of chain (i.e. MRU position) predicted to be re-referenced soon

• Tail of chain (i.e. LRU position) predicted to re-referenced far in the future

• LRU predicts that blocks are re-referenced in reverse order of reference

• Rename “LRU Chain” to the “Re-Reference Prediction (RRP) Chain ”

• Rename “MRU position” RRP Head and “LRU position” RRP Tail

h g f e d c b a

MRU position LRU positionRRP head RRP tail

0 1 2 3 4 5 6 7LRU chain

position stored with each cache

block

Page 9: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation,

Practicality of Chain Based Replacement

• Problem: Chain based replacement is too expensive!• log2(associativity) bits required per cache block (16-way requires 4-bits/block)

• Solution: LRU chain positions can be quantized into different buckets• Each bucket corresponds to a predicted Re-Reference Interval• Value of bucket is called the Re-Reference Prediction Value (RRPV)

• Hardware Cost: ‘n’ bits per block [ ideally you would like n < log2A ]

99

RRP Head RRP Tail

RRPV (n=2): 3

‘distant’

2

‘far’

1

‘intermediate’

0‘near-

immediate’Qualitative Prediction:

h g f e d c b a

Page 10: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation,

Representation of Quantized Replacement (n = 2)

1010

Cache Tag

3

a

2

c

3

b

0

h

1

f

1

d

0

g

1

e

RRPV

0 1 2 3 4 5 6 7Physical Way #

RRP Head RRP Tail

RRPV: 3

‘distant’

2

‘far’

1

‘intermediate’

0‘near-

immediate’Qualitative Prediction:

h g f e d c b a

Page 11: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation,

Emulating LRU with Quantized Buckets (n=2)

• Victim Selection Policy: Evict block with distant RRPV (i.e. 2n-1 = ‘3’) • If no distant RRPV (i.e. ‘3’) found, increment all RRPVs and repeat the search• If multiple found, need tie breaker. Let us always start search from physical

way ‘0’

• Insertion Policy: Insert new block with RRPV=‘0’

• Update Policy: Cache hits update the block’s RRPV=‘0’

1111

Cache Tag

3

a

2

c

3

b

0

h

1

f

1

d

0

g

1

e

RRPV

0 1 2 3 4 5 6 7Physical Way #

victim block

0

s

hit

0

But We Want to do BETTER than LRU!!!

Page 12: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation,

Re-Reference Interval Prediction (RRIP)

• Framework enables re-reference predictions to be tuned at insertion/update• Unlike LRU, can use non-zero RRPV on insertion• Unlike LRU, can use a non-zero RRPV on cache hits

• Static Re-Reference Interval Prediction (SRRIP)• Determine best insertion/update prediction using profiling [and apply to all apps]

• Dynamic Re-Reference Interval Prediction (DRRIP)• Dynamically determine best re-reference prediction at insertion

1212

Cache Tag

3

a

2

c

3

b

0

h

1

f

1

d

0

g

1

e

RRPV

0 1 2 3 4 5 6 7Physical Way #

Page 13: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation,

Static RRIP Insertion Policy – Learn Block’s Re-reference Interval

• Key Idea: Do not give new blocks too much (or too little) time in the cache• Predict new cache block will not be re-referenced soon• Insert new block with some RRPV other than ‘0’• Similar to inserting in the “middle” of the RRP chain

• However it is NOT identical to a fixed insertion position on RRP chain (see paper)

1313

Cache Tag

3

a

2

c

3

b

0

h

1

f

1

d

0

g

1

e

RRPV

0 1 2 3 4 5 6 7Physical Way #

victim block

2

s

Page 14: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation,

2

Static RRIP Update Policy on Cache Hits

• Hit Priority (HP)• Like LRU, Always update RRPV=0 on cache hits. • Intuition: Predicts that blocks receiving hits after insertion will be re-referenced

soon

1414

Cache Tag

2

s c

3

b

0

h

1

f

1

d

0

g

1

e

RRPV

0 1 2 3 4 5 6 7Physical Way #

hit

0

An Alternative Update Scheme Also Described in Paper

Page 15: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation,

SRRIP Hit Priority Sensitivity to Cache Insertion Prediction at LLC

1515

n=1

Averaged Across PC Games, Multimedia, Server, and SPEC06 Workloads on 16-way 2MB LLC

Re-Reference Interval Prediction (RRIP) Value At Insertion

n=1 is in fact the NRU replacement policy commonly used in commercial processors

Page 16: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation,

SRRIP Hit Priority Sensitivity to Cache Insertion Prediction at LLC

1616

n=1 n=2 n=3 n=4 n=5

Averaged Across PC Games, Multimedia, Server, and SPEC06 Workloads on 16-way 2MB LLC

Regardless of ‘n’ Static RRIP Performs Best When RRPVinsertion is 2n-2Regardless of ‘n’ Static RRIP Performs Worst When RRPVinsertion is 2n-1

Re-Reference Interval Prediction (RRIP) Value At Insertion

Page 17: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation,

Why Does RRPVinsertion of 2n-2 Work Best for SRRIP?

• Before scan, re-reference prediction of active working set is ‘0’

• Recall, NRU (n=1) is not scan-resistant• For scan resistance RRPVinsertion MUST be different from RRPV of working set blocks

• Larger insertion RRPV tolerates larger scans• Maximum insertion prediction (i.e. 2n-2) works best!

• In general, re-references after scan hit IF

Slen < ( RRPVinsertion – Starting-RRPVworkingset) * (LLCsize – Wsize)

SRRIP is Scan Resistant for Slen < ( RRPVinsertion ) * (LLCsize – Wsize)

1717

hit hit hit ? hit hit? ?scan scan scan

Slen

Wsize

For n > 1 Static RRIP is Scan Resistant! What about Thrash Resistance?

Page 18: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation,

DRRIP: Extending Scan-Resistant SRRIP to Be Thrash-Resistant

1818

miss miss miss missmiss

• Always using same prediction for all insertions will thrashes the cache

• Like DIP, need to preserve some fraction of working set in cache• Extend DIP to SRRIP to provide thrash resistance

• Dynamic Re-Reference Interval Prediction:• Dynamically select between inserting blocks with 2n-1 and 2n-2 using Set Dueling• Inserting blocks with 2n-1 is same as “no update on insertion”

SRRIP

miss hit hit hit hit DRRIP miss miss miss miss

DRRIP Provides Both Scan-Resistance and Thrash-Resistance

Page 19: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation,

Performance Comparison of Replacement Policies

1919

Static RRIP Always Outperforms LRU Replacement Dynamic RRIP Further Improves Performance of Static RRIP

16-way 2MB LLC

Page 20: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation,

Cache Replacement Competition (CRC) Results

2020

16-way 1MB Private Cache

65 Single-Threaded Workloads

16-way 4MB Shared Cache

165 4-core Workloads

Averaged Across PC Games, Multimedia, Enterprise Server, SPEC CPU2006 Workloads

DRRIP

DRRIP

Un-tuned DRRIP Would Be Ranked 2nd and is within 1% of CRC WinnerUnlike CRC Winner, DRRIP Does Not Require Any Changes to Cache Structure

Page 21: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation,

Total Storage Overhead (16-way Set Associative Cache)

2121

• LRU: 4-bits / cache block

• NRU 1-bit / cache block

• DRRIP-3: 3-bits / cache block

• CRC Winner: ~8-bits / cache block

DRRIP Outperforms LRU With Less Storage Than LRU

NRU Can Be Easily Extended to Realize DRRIP!

Page 22: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation,

Summary

• Scan-resistance is an important problem in commercial workloads• State-of-the art policies do not address scan-resistance

• We Propose a Simple and Practical Replacement Policy• Static RRIP (SRRIP) for scan-resistance• Dynamic RRIP (DRRIP) for thrash-resistance and scan-resistance

• DRRIP requires ONLY 3-bits per block• In fact it incurs less storage than LRU

• Un-tuned DRRIP would be 2nd place in CRC Championship• DRRIP requires significantly less storage than CRC winner

2222

Page 23: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation,

2323

Q&A

Page 24: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation,

2424

Q&A

Page 25: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation,

2525

Q&A

Page 26: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation,

Static RRIP with n=1

• Static RRIP with n = 1 is the commonly used NRU policy (polarity inverted)• Victim Selection Policy: Evict block with RRPV=‘1’ • Insertion Policy: Insert new block with RRPV=‘0’• Update Policy: Cache hits update the block’s RRPV=‘0’

2626

Cache Tag

1

a

1

c

1

b

0

h

1

f

1

d

0

g

1

e

RRPV

0 1 2 3 4 5 6 7Physical Way #

victim block

0

s

hit

0

But NRU Is Not Scan-Resistant

Page 27: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation,

2

SRRIP Update Policy on Cache Hits

1

• Frequency Priority (FP): • Improve re-reference prediction to be shorter than before on hits (i.e. RRPV--). • Intuition: Like LFU, predicts that frequently referenced blocks should have

higher priority to stay in cache

2727

Cache Tag

2

s c

3

b

0

h

1

f

1

d

0

g

1

e

RRPV

0 1 2 3 4 5 6 7Physical Way #

Page 28: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation,

SRRIP-HP and SRRIP-FP Cache Performance

2828

-10

-5

0

5

10

15

20

GAMES MULTIMEDIA SERVER SPEC06 ALL

% F

ewer

Cac

he M

isse

s Re

lativ

e to

LRU

n=1 n=2 n=3 n=4 n=5

-10

-5

0

5

10

15

20

GAMES MULTIMEDIA SERVER SPEC06 ALL

% F

ewer

Cac

he M

isse

s Re

lativ

e to

LRU

n=1 n=2 n=3 n=4 n=5

SRRIP-Frequency Priority

SRRIP-Hit Priority

• SRRIP-HP has 2X better cache performance relative to LRU than SRRIP-FP

• We do not need to precisely detect frequently referenced blocks

• We need to preserve blocks that receive hits

Page 29: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation,

Common Access Patterns in Workloads

• Stack Access Pattern: (a1, a2,…ak,…a2, a1)A

• Solution: For any ‘k’, LRU performs well for such access patterns

• Streaming Access Pattern: (a1, a2,… ak) for k >> assoc• No Solution: Cache replacement can not solve this problem

• Thrashing Access Pattern: (a1, a2,… ak)A , for k > assoc• LRU receives no cache hits due to cache thrashing• Solution: preserve some fraction of working set in cache (e.g. Use BIP)

• BIP does NOT update replacement state for the majority of cache insertions

• Mixed Access Pattern: [(a1, a2,…ak,…a2, a1)A (b1, b2,… bm)] N, m > assoc-k

• LRU always misses on frequently referenced: (a1, a2, … ak, … a2, a1)A

• (b1, b2, … bm) commonly referenced to as a scan in literature

• In absence of scan, LRU performs well for such access patterns• Solution: preserve frequently referenced working set in cache (e.g. use LFU)

• LFU replaces infrequently referenced blocks in the presence of frequently referenced blocks

2929

Gam

es,

Mu

ltim

ed

ia,

En

terp

rise

Serv

er,

Mix

ed

Work

load

s

Page 30: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation,

-100

10203040506070

halfl

ife2

halo

gunm

etal

2

final

-fant

asy

phot

osho

p

rend

erm

an sap

tpc-

c

app-

serv

er

cact

usAD

M_b

sphi

nx3_

a

hmm

er_n

mcf

_r

bzip

2_c

GAM

ES

MU

LTIM

EDIA

SERV

ER

SPEC

06 ALL%

Few

er M

isse

s Re

lativ

e to

LRU

DIP

HYB(LRU, LFU)

Performance of Hybrid Replacement Policies at LLC

• DIP addresses SPEC workloads but NOT PC games & multimedia workloads

• Real world workloads prefer scan-resistance instead of thrash-resistance 3030

PC Games / multimedia SPEC CPU2006server Average

4-way OoO Processor, 32KB L1, 256KB L2, 2MB LLC

Page 31: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation,

Understanding LRU Enhancements in the Prediction Framework

• Recent policies, e.g., DIP, say “Insert new blocks at the ‘LRU position’”• What does it mean to insert an MRU line in the LRU position? • Prediction that new block will be re-referenced later than existing blocks in the

cache• What DIP really means is “Insert new blocks at the `RRIP Tail’ ”

• Other policies, e.g., PIPP, say “Insert new blocks in ‘middle of the LRU chain’”• Prediction that new block will be re-referenced at an intermediate time

3131

RRP Head RRP Tail

The Re-Reference Prediction Framework Helps Describe

the Intuitions Behind Existing Replacement Policy Enhancements

h g f e d c b a

Page 32: High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) Aamer Jaleel, Kevin Theobald, Simon Steely Jr., Joel Emer Intel Corporation,

Performance Comparison of Replacement Policies

3232

Static RRIP Always Outperforms LRU Replacement Dynamic RRIP Further Improves Performance of Static RRIP

16-way 2MB LLC