Caches Where is a block placed in a cache? –Three possible answers three different types...

Caches• Where is a block placed in a cache?

– Three possible answers three different types

Anywhere Fully associative

Only intoone block

Direct mapped

Into subsetof blocks

Set associative

How is a block found?• Cache has an address tag for each block

• Tags are checked in parallel for a match

• Also has a valid bit

Processor address: Block Address Block OffsetTag Index

Identifies datain blockIdentifies setIdentifies block

Which block should be replaced on a miss?

• Direct mapped:– Simple (there can only be one!)

• Associative caches:– Choice involved– Three techniques

• Random

• Least-recently used (LRU)– Often only approximated

• FIFO (approximates LRU)

Random vs LRU (16kB cache)

0

1

2

3

4

5

6

2-way 4-way 8-way

LRURandom

Mis

s R

ate

(%)

Random vs LRU (256kB cache)

0

0.2

0.4

0.6

0.8

1

1.2

2-way 4-way 8-way

LRURandom

Mis

s R

ate

(%)

What happens on a write?

• Reads predominate– Instruction fetches, more loads than stores– MIPS instruction mix:

• 10% stores

• 37% loadsWrites: 7% of memory traffic,21% of data traffic

Amdahl’s Law: We can’t ignore them!

Write Strategy

• Must complete checking tags before starting to write– Read can sometimes proceed safely while tags

are checked

• Must modify only part of the block– Reads can read more than is required

Write Strategy

• Two main approaches

Dirty bit

• Write through

Cache MainMemory

CPU

• Write back

Cache MainMemory×

CPU

Advantages

• Write back– Writes occur at cache speeds– Only one memory access after multiple writes

• Lower memory bandwidth

• Write through– Efficient read misses– Simple implementation– Memory and cache are consistent

Good for multiprocessors

Good for multi-processors!

Optimising Write Through

• Reduce write stalls– Write buffer

• Processor continues while write buffer updates memory

Handling Write Misses

• Write allocate– Fetch block into cache on miss– Good with write back

• No-write allocate– Memory is updated without loading block into

cache– Good with write through

Alpha 21264 Data Cache

• Data cache– 64kB– 64-byte blocks– 2-way set associative– Write back

• Write allocate

– Victim Buffer (similar to Write Buffer)• 8 blocks

Alpha Data Cache Hit

Data Cache

• Uses FIFO (one bit per set)

• If victim buffer is full, CPU must stall

• Write miss:– Write allocate– Similar to read miss

Performance

• Hit– 3 cycles

• Three cycle load delay

• Miss– 9ns to transfer data from next level (6 cycles @

667MHz)

Alpha 21264 Instruction Cache

• Instruction cache– Separate from data cache– 64kB

Separate Caches

• Doubles available bandwidth– Prevents fetch unit stalling on data accesses

• Caches can be optimised separately– UltraSPARC

• Data cache: 16kB, direct mapped, 2 × 16-byte sub-blocks

• Instruction cache: 16kB, 2-way set associative, 32-byte blocks

Unified Caches

• Hold both data and instructions

• Miss rates for instructions are much lower than for data (an order of magnitude)

• Unified cache may have slightly better overall miss rate– 16kB data cache: 11.4%– 16kB instruction cache: 0.4%– 32kB unified cache: 3.18%

3.24%

BUT: extra cycle stall for unified cache: average memoryaccess time is slower (4.44 rather than 4.24 cycles)

5.3. Cache Performance

• Miss rate can be misleading– See last example!

• Better measure is average memory access time

= Hit time + Miss rate × Miss penalty

Performance Issues• Cache is very significant factor

– Example: CPU time increased by 4

• Particularly for:– Low CPI machines– Fast clock speeds

• Simplicity of direct mapped cache may give faster clock rate

Miss Penalty and Out of Order Execution

• Processor may be able to do useful work during cache miss

• Makes analysis of cache performance very difficult!

• Can have a significant impact

Improving Cache Performance

• Very important topic– 1600 papers in 6 years! (2nd Edition)– 5000 papers in 13 years! (3rd Edition)

Improving Cache Performance

• Four categories of optimisation:– Reduce miss rate– Reduce miss penalty– Reduce miss rate or miss penalty using

parallelism– Reduce hit time

AMAT = Hit time + Miss rate × Miss penalty

5.4. Reducing Miss Penalty

• Traditionally, focus on miss rate

• But, cost of miss penalties is increasing dramatically

Multi-level Caches

• Two caches– A small, fast one close to the CPU– A big, slower one between the first cache and

memory

L1cache Main

Memory

CPU L2 Cache

Second-level caches

• Complicates analysis

L1L1L1 penalty Miss RateMissTime Hit AMAT

L2L2L2L1 penalty Miss RateMissTime Hit PenaltyMiss

Analysis of two-level caches

• Local miss rate– Number of misses / number of accesses to this

cache– Artificially high for L2 cache

• Global miss rate– Number of misses / number of accesses by CPU

– Miss rateL1 × Miss rateL2 for L2 cache

Design of two-level caches

• Second level cache should be large– Minimises local miss rate– Big blocks are more feasible (reducing miss

rate)

• Multilevel inclusion property– All data in L1 is also in L2– Useful for multiprocessor consistency

• Can be enforced at L2

Early restart & critical word first

• Minimise CPU waiting time

• Early restart– As soon as requested word arrives send it to

CPU

• Critical word first– Request required word from memory first then

fill rest of cache block

Prioritising read misses

• Write-through caches normally make use of a write buffer

• Problem: may lead to RAW hazards

• Solution: stall read miss until write buffer empties– May be as much as 50% increase in read miss

• Better solution: check write buffer for conflict

Prioritising read misses

• Write-back caches– Long read misses due to writing back dirty

block

• Solution:– Write buffer– Handle read miss then write back the dirty

block– Need to do the same conflict checking (or stall

for the write buffer to drain)

Merging Write Buffer

• Write buffers merge data being written to the same area of memory

• Benefits:– More efficient use of buffer– Reduces stalls due to write buffer being full

Victim Caches

• Small ( 5 entries), fully associative cache on the refill path– Holds recently discarded blocks

• Temporal locality

– Experiment (4kB, direct-mapped cache):• 4-entry victim cache

• Removed 20% to 95% of conflict misses

• AMD Athlon: 8 entry victim cache

Caches Where is a block placed in a cache? –Three possible answers three different types...

Documents

Transcript of Caches Where is a block placed in a cache? –Three possible answers three different types...