09 Cache Perf

8/10/2019 09 Cache Perf

1/24

CPE232 Improving Cache Performance 1

CPE 232Computer OrganizationSpring 2006

Improving Cache Performance

Dr. Gheith Abandah

[Adapted from the slides of Professor Mary Irwin (www.cse.psu.edu/~mji)

which in turn Adapted from Computer Organization and Design,Patterson & Hennessy, 2005, UCB]
http://www.cse.psu.edu/~mjihttp://www.cse.psu.edu/~mji

8/10/2019 09 Cache Perf

2/24


Review: The Memory Hierarchy

Increasingdistancefrom theprocessor inaccesstime

L1$

L2$

Main Memory

Secondary Memory

Processor

(Relative) size of the memory at each level

Inclusivewhat

is in L1$ is asubset of whatis in L2$ is asubset of whatis in MM that isa subset of is

in SM

4-8 bytes (word)

1 to 4 blocks

1,024+ bytes (disk sector = page)

8-32 bytes (block)

Take advantage of the principle of locality to present theuser with as much memory as is available in the cheapest

technology at the speed offered by the fastest technology

8/10/2019 09 Cache Perf

3/24


Review: Principle of Locality

Temporal Locality

Keep most recentlyaccessed data items closer to the processor

Spatial Locality

Move blocksconsistingof contiguous wordsto the upper levels

Hit Time

8/10/2019 09 Cache Perf

4/24


Measuring Cache Performance

Assuming cache hit costs are included as part of thenormal CPU execution cycle, then

CPU time = IC CPI CC

= IC (CPIideal+ Memory-stall cycles) CC

CPIstall

Memory-stall cycles come from cache misses (a sum ofread-stalls and write-stalls)

Read-stall cycles = reads/program read miss rateread miss penalty

Write-stall cycles = (writes/program write miss rate

write miss penalty)+ write buffer stalls

For write-through caches, we can simplify this to

Memory-stall cycles = miss rate miss penalty

8/10/2019 09 Cache Perf

5/24


Review: The Memory Wall

Logic vs DRAM speed gap continues to grow

0.01

0.1

1

10

100

1000

VAX/1980 PPro/1996 2010+

Core

Memory

Clocksperinstr

uction

ClocksperDRAMa

ccess

8/10/2019 09 Cache Perf

6/24


Impacts of Cache Performance

Relative cache penalty increases as processorperformance improves (faster clock rate and/or lower CPI)

The memory speed is unlikely to improve as fast as processorcycle time. When calculating CPIstall, the cache miss penalty ismeasured inprocessorclock cycles needed to handle a miss

The lower the CPIideal, the more pronounced the impact of stalls

A processor with a CPIidealof 2, a 100 cycle miss penalty,36% load/store instrs, and 2% I$ and 4% D$ miss rates

Memory-stall cycles = 2% 100 + 36% 4% 100 = 3.44

So CPIstalls = 2 + 3.44 = 5.44

What if the CPIidealis reduced to 1? 0.5? 0.25?

What if the processor clock rate is doubled (doubling themiss penalty)?

8/10/2019 09 Cache Perf

7/24CPE232 Improving Cache Performance 7

Reducing Cache Miss Rates #1

1. Allow more flexible block placement

In a direct mappedcachea memory block maps toexactly one cache block

At the other extreme, could allow a memory block to be

mapped to any cache blockfully associative cache

A compromise is to divide the cache into setseach ofwhich consists of n ways (n-way set associative). Amemory block maps to a unique set (specified by theindex field) and can be placed in any way of that set (sothere are n choices)

(block address) modulo (# sets in the cache)

8/10/2019 09 Cache Perf


Set Associative Cache Example

0

Cache

Main Memory

Q2: How do we find it?

Usenext 1 low ordermemory address bittodetermine which

cache set (i.e., modulothe number of sets inthe cache)

Tag Data

Q1: Is it there?

Compare allthe cachetagsin the set to thehigh order 3 memoryaddress bitsto tell ifthe memory block is inthe cache

V

0000xx

0001xx

0010xx0011xx

0100xx

0101xx

0110xx

0111xx1000xx

1001xx

1010xx

1011xx

1100xx

1101xx

1110xx

1111xx

Two low order bits

define the byte in theword (32-b words)One word blocks

Set

1

0

1

Way

0

1

8/10/2019 09 Cache Perf


Another Reference String Mapping

0 4 0 4

Consider the main memory word reference string

0 4 0 4 0 4 0 4

miss miss hit hit

000 Mem(0) 000 Mem(0)

Start with an empty cache - all

blocks initially marked as not valid

010 Mem(4) 010 Mem(4)

000 Mem(0) 000 Mem(0)

010 Mem(4)

Solves the ping pong effect in a direct mapped cachedue to conflictmisses since now two memory locationsthat map into the same cache set can co-exist!

8 requests, 2 misses

8/10/2019 09 Cache Perf


Four-Way Set Associative Cache 28= 256 sets each with four ways (each with one block)

31 30 . . . 13 12 11 . . . 2 1 0 Byte offset

DataTagV0

1

2

.

.

.

253

254

255

DataTagV0

1

2

.

.

.

253

254

255

DataTagV0

1

2

.

.

.

253

254

255

Index DataTagV0

1

2

.

.

.

253

254

255

8Index

22Tag

Hit Data

32

4x1 select

8/10/2019 09 Cache Perf


Range of Set Associative Caches For a fixed size cache, each increase by a factor of two

in associativity doubles the number of blocks per set (i.e.,

the number or ways) and halves the number of setsdecreases the size of the index by 1 bit and increasesthe size of the tag by 1 bit

Block offset Byte offsetIndexTag

Decreasing associativity

Fully associative(only one set)Tag is all the bits exceptblock and byte offset

Direct mapped(only one way)Smaller tags

Increasing associativity

Selects the setUsed for tag compare Selects the word in the block

8/10/2019 09 Cache Perf


Costs of Set Associative Caches

When a miss occurs, which ways block do we pick forreplacement?

Least Recently Used (LRU): the block replaced is the one thathas been unused for the longest time

- Must have hardware to keep track of when each ways block wasused relative to the other blocks in the set

- For 2-way set associative, takes one bit per setset the bit when ablock is referenced (and reset the other ways bit)

N-way set associative cache costs

N comparators (delay and area)

MUX delay (set selection) before data is available

Data available afterset selection (and Hit/Miss decision). In adirect mapped cache, the cache block is available beforetheHit/Miss decision

- So its not possible to just assume a hit and continue and recover laterif it was a miss

8/10/2019 09 Cache Perf


Benefits of Set Associative Caches

The choice of direct mapped or set associative dependson the cost of a miss versus the cost of implementation

0

2

4

6

8

10

12

1-way 2-way 4-way 8-way

Associativity

MissR

ate

4KB

8KB

16KB

32KB

64KB128KB

256KB

512KB

Data from Hennessy &

Patterson, ComputerArchitecture, 2003

Largest gains are in going from direct mapped to 2-way

(20%+ reduction in miss rate)

8/10/2019 09 Cache Perf

14/24


Reducing Cache Miss Rates #2

2. Use multiple levels of caches

With advancing technology have more than enoughroom on the die for bigger L1 caches orfor a secondlevel of cachesnormally a unifiedL2 cache (i.e., itholds both instructions and data) and in some caseseven a unified L3 cache

For our example, CPIidealof 2, 100 cycle miss penalty(to main memory), 36% load/stores, a 2% (4%) L1I$(D$) miss rate, add a UL2$ that has a 25 cycle misspenalty and a 0.5% miss rate

CPIstalls = 2 + .0225 + .36.0425 + .005100 +.36.005100 = 3.54

(as compared to 5.44 with no L2$)

8/10/2019 09 Cache Perf

15/24


Multilevel Cache Design Considerations

Design considerations for L1 and L2 caches are verydifferent

Primary cache should focus on minimizing hit timein support ofa shorter clock cycle

- Smaller with smaller block sizes

Secondary cache(s) should focus on reducing miss rateto

reduce the penalty of long main memory access times- Larger with larger block sizes

The miss penalty of the L1 cache is significantly reducedby the presence of an L2 cacheso it can be smaller(i.e., faster) but have a higher miss rate

For the L2 cache, hit time is less important than miss rate

The L2$ hit time determines L1$s miss penalty

L2$ local miss rate >> than the global miss rate

8/10/2019 09 Cache Perf

16/24


Key Cache Design Parameters

L1 typical L2 typical

Total size (blocks) 250 to 2000 4000 to250,000

Total size (KB) 16 to 64 500 to 8000

Block size (B) 32 to 64 32 to 128

Miss penalty (clocks) 10 to 25 100 to 1000

Miss rates(global for L2)

2% to 5% 0.1% to 2%

8/10/2019 09 Cache Perf

17/24


Two Machines Cache Parameters

Intel P4 AMD Opteron

L1 organization Split I$ and D$ Split I$ and D$

L1 cache size 8KB for D$, 96KB fortrace cache (~I$)

64KB for each of I$ and D$

L1 block size 64 bytes 64 bytes

L1 associativity 4-way set assoc. 2-way set assoc.

L1 replacement ~ LRU LRU

L1 write policy write-through write-back

L2 organization Unified Unified

L2 cache size 512KB 1024KB (1MB)

L2 block size 128 bytes 64 bytes

L2 associativity 8-way set assoc. 16-way set assoc.

L2 replacement ~LRU ~LRU

L2 write policy write-back write-back

8/10/2019 09 Cache Perf

18/24


4 Questions for the Memory Hierarchy

Q1: Where can a block be placed in the upper level?

(Block placement)

Q2: How is a block found if it is in the upper level?(Block identification)

Q3: Which block should be replaced on a miss?(Block replacement)

Q4: What happens on a write?(Write strategy)

8/10/2019 09 Cache Perf

19/24


Q1&Q2: Where can a block be placed/found?

# of sets Blocks per setDirect mapped # of blocks in cache 1

Set associative (# of blocks in cache)/associativity

Associativity (typically2 to 16)

Fully associative 1 # of blocks in cache

Location method # of comparisons

Direct mapped Index 1

Set associative Index the set; comparesets tags

Degree ofassociativity

Fully associative Compare all blocks tags # of blocks

8/10/2019 09 Cache Perf

20/24


Q3: Which block should be replaced on a miss?

Easy for direct mappedonly one choice

Set associative or fully associative Random

LRU (Least Recently Used)

For a 2-way set associative cache, randomreplacement has a miss rate about 1.1 times higherthan LRU.

LRU is too costly to implement for high levels of

associativity (> 4-way) since tracking the usageinformation is costly

8/10/2019 09 Cache Perf

21/24


Q4: What happens on a write?

Write-throughThe information is written to both theblock in the cache and to the block in the next lower levelof the memory hierarchy

Write-through is always combined with a write buffer so writewaits to lower level memory can be eliminated (as long as thewrite buffer doesnt fill)

Write-backThe information is written only to the block inthe cache. The modified cache block is written to mainmemory only when it is replaced.

Need a dirty bit to keep track of whether the block is clean or dirty

Pros and cons of each? Write-through: read misses dont result in writes (so are simpler

and cheaper)

Write-back: repeated writes require only one write to lower level

8/10/2019 09 Cache Perf

22/24



0. Reduce the time to hit in the cache

smaller cache

direct mapped cache

smaller blocks

for writes

- no write allocateno hit on cache, just write to write buffer

- write allocateto avoid two cycles (first check for hit, then write)pipeline writes via a delayed write buffer to cache

1. Reduce the miss rate

bigger cache more flexible placement (increase associativity)

larger blocks (16 to 64 bytes typical)

victim cachesmall buffer holding most recently discarded blocks

8/10/2019 09 Cache Perf

23/24



2. Reduce the miss penalty smaller blocks

use a write buffer to hold dirty blocks being replaced so donthave to wait for the write to complete before reading

check write buffer (and/or victim cache) on read missmay getlucky

for large blocks fetch critical word first use multiple cache levelsL2 cache not tied to CPU clock rate

faster backing store/improved memory bandwidth

- wider buses

- memory interleaving, page mode DRAMs

8/10/2019 09 Cache Perf

24/24


Summary: The Cache Design Space

Several interacting dimensions

cache size

block size

associativity

replacement policy

write-through vs write-back

write allocation

The optimal choice is a compromise

depends on access characteristics

- workload

- use (I-cache, D-cache, TLB)

depends on technology / cost

Simplicity often wins

Associativity

Cache Size

Block Size

Bad

Good

Less More

Factor A Factor B

09 Cache Perf

Documents

Transcript of 09 Cache Perf