09 Cache Perf
-
Upload
muh-fauzi-natsir -
Category
Documents
-
view
224 -
download
0
Transcript of 09 Cache Perf
-
8/10/2019 09 Cache Perf
1/24
CPE232 Improving Cache Performance 1
CPE 232Computer OrganizationSpring 2006
Improving Cache Performance
Dr. Gheith Abandah
[Adapted from the slides of Professor Mary Irwin (www.cse.psu.edu/~mji)
which in turn Adapted from Computer Organization and Design,Patterson & Hennessy, 2005, UCB]
http://www.cse.psu.edu/~mjihttp://www.cse.psu.edu/~mji -
8/10/2019 09 Cache Perf
2/24
CPE232 Improving Cache Performance 2
Review: The Memory Hierarchy
Increasingdistancefrom theprocessor inaccesstime
L1$
L2$
Main Memory
Secondary Memory
Processor
(Relative) size of the memory at each level
Inclusivewhat
is in L1$ is asubset of whatis in L2$ is asubset of whatis in MM that isa subset of is
in SM
4-8 bytes (word)
1 to 4 blocks
1,024+ bytes (disk sector = page)
8-32 bytes (block)
Take advantage of the principle of locality to present theuser with as much memory as is available in the cheapest
technology at the speed offered by the fastest technology
-
8/10/2019 09 Cache Perf
3/24
CPE232 Improving Cache Performance 3
Review: Principle of Locality
Temporal Locality
Keep most recentlyaccessed data items closer to the processor
Spatial Locality
Move blocksconsistingof contiguous wordsto the upper levels
Hit Time
-
8/10/2019 09 Cache Perf
4/24
CPE232 Improving Cache Performance 4
Measuring Cache Performance
Assuming cache hit costs are included as part of thenormal CPU execution cycle, then
CPU time = IC CPI CC
= IC (CPIideal+ Memory-stall cycles) CC
CPIstall
Memory-stall cycles come from cache misses (a sum ofread-stalls and write-stalls)
Read-stall cycles = reads/program read miss rateread miss penalty
Write-stall cycles = (writes/program write miss rate
write miss penalty)+ write buffer stalls
For write-through caches, we can simplify this to
Memory-stall cycles = miss rate miss penalty
-
8/10/2019 09 Cache Perf
5/24
CPE232 Improving Cache Performance 5
Review: The Memory Wall
Logic vs DRAM speed gap continues to grow
0.01
0.1
1
10
100
1000
VAX/1980 PPro/1996 2010+
Core
Memory
Clocksperinstr
uction
ClocksperDRAMa
ccess
-
8/10/2019 09 Cache Perf
6/24
CPE232 Improving Cache Performance 6
Impacts of Cache Performance
Relative cache penalty increases as processorperformance improves (faster clock rate and/or lower CPI)
The memory speed is unlikely to improve as fast as processorcycle time. When calculating CPIstall, the cache miss penalty ismeasured inprocessorclock cycles needed to handle a miss
The lower the CPIideal, the more pronounced the impact of stalls
A processor with a CPIidealof 2, a 100 cycle miss penalty,36% load/store instrs, and 2% I$ and 4% D$ miss rates
Memory-stall cycles = 2% 100 + 36% 4% 100 = 3.44
So CPIstalls = 2 + 3.44 = 5.44
What if the CPIidealis reduced to 1? 0.5? 0.25?
What if the processor clock rate is doubled (doubling themiss penalty)?
-
8/10/2019 09 Cache Perf
7/24CPE232 Improving Cache Performance 7
Reducing Cache Miss Rates #1
1. Allow more flexible block placement
In a direct mappedcachea memory block maps toexactly one cache block
At the other extreme, could allow a memory block to be
mapped to any cache blockfully associative cache
A compromise is to divide the cache into setseach ofwhich consists of n ways (n-way set associative). Amemory block maps to a unique set (specified by theindex field) and can be placed in any way of that set (sothere are n choices)
(block address) modulo (# sets in the cache)
-
8/10/2019 09 Cache Perf
8/24CPE232 Improving Cache Performance 8
Set Associative Cache Example
0
Cache
Main Memory
Q2: How do we find it?
Usenext 1 low ordermemory address bittodetermine which
cache set (i.e., modulothe number of sets inthe cache)
Tag Data
Q1: Is it there?
Compare allthe cachetagsin the set to thehigh order 3 memoryaddress bitsto tell ifthe memory block is inthe cache
V
0000xx
0001xx
0010xx0011xx
0100xx
0101xx
0110xx
0111xx1000xx
1001xx
1010xx
1011xx
1100xx
1101xx
1110xx
1111xx
Two low order bits
define the byte in theword (32-b words)One word blocks
Set
1
0
1
Way
0
1
-
8/10/2019 09 Cache Perf
9/24CPE232 Improving Cache Performance 9
Another Reference String Mapping
0 4 0 4
Consider the main memory word reference string
0 4 0 4 0 4 0 4
miss miss hit hit
000 Mem(0) 000 Mem(0)
Start with an empty cache - all
blocks initially marked as not valid
010 Mem(4) 010 Mem(4)
000 Mem(0) 000 Mem(0)
010 Mem(4)
Solves the ping pong effect in a direct mapped cachedue to conflictmisses since now two memory locationsthat map into the same cache set can co-exist!
8 requests, 2 misses
-
8/10/2019 09 Cache Perf
10/24CPE232 Improving Cache Performance 10
Four-Way Set Associative Cache 28= 256 sets each with four ways (each with one block)
31 30 . . . 13 12 11 . . . 2 1 0 Byte offset
DataTagV0
1
2
.
.
.
253
254
255
DataTagV0
1
2
.
.
.
253
254
255
DataTagV0
1
2
.
.
.
253
254
255
Index DataTagV0
1
2
.
.
.
253
254
255
8Index
22Tag
Hit Data
32
4x1 select
-
8/10/2019 09 Cache Perf
11/24CPE232 Improving Cache Performance 11
Range of Set Associative Caches For a fixed size cache, each increase by a factor of two
in associativity doubles the number of blocks per set (i.e.,
the number or ways) and halves the number of setsdecreases the size of the index by 1 bit and increasesthe size of the tag by 1 bit
Block offset Byte offsetIndexTag
Decreasing associativity
Fully associative(only one set)Tag is all the bits exceptblock and byte offset
Direct mapped(only one way)Smaller tags
Increasing associativity
Selects the setUsed for tag compare Selects the word in the block
-
8/10/2019 09 Cache Perf
12/24CPE232 Improving Cache Performance 12
Costs of Set Associative Caches
When a miss occurs, which ways block do we pick forreplacement?
Least Recently Used (LRU): the block replaced is the one thathas been unused for the longest time
- Must have hardware to keep track of when each ways block wasused relative to the other blocks in the set
- For 2-way set associative, takes one bit per setset the bit when ablock is referenced (and reset the other ways bit)
N-way set associative cache costs
N comparators (delay and area)
MUX delay (set selection) before data is available
Data available afterset selection (and Hit/Miss decision). In adirect mapped cache, the cache block is available beforetheHit/Miss decision
- So its not possible to just assume a hit and continue and recover laterif it was a miss
-
8/10/2019 09 Cache Perf
13/24CPE232 Improving Cache Performance 13
Benefits of Set Associative Caches
The choice of direct mapped or set associative dependson the cost of a miss versus the cost of implementation
0
2
4
6
8
10
12
1-way 2-way 4-way 8-way
Associativity
MissR
ate
4KB
8KB
16KB
32KB
64KB128KB
256KB
512KB
Data from Hennessy &
Patterson, ComputerArchitecture, 2003
Largest gains are in going from direct mapped to 2-way
(20%+ reduction in miss rate)
-
8/10/2019 09 Cache Perf
14/24
CPE232 Improving Cache Performance 14
Reducing Cache Miss Rates #2
2. Use multiple levels of caches
With advancing technology have more than enoughroom on the die for bigger L1 caches orfor a secondlevel of cachesnormally a unifiedL2 cache (i.e., itholds both instructions and data) and in some caseseven a unified L3 cache
For our example, CPIidealof 2, 100 cycle miss penalty(to main memory), 36% load/stores, a 2% (4%) L1I$(D$) miss rate, add a UL2$ that has a 25 cycle misspenalty and a 0.5% miss rate
CPIstalls = 2 + .0225 + .36.0425 + .005100 +.36.005100 = 3.54
(as compared to 5.44 with no L2$)
-
8/10/2019 09 Cache Perf
15/24
CPE232 Improving Cache Performance 15
Multilevel Cache Design Considerations
Design considerations for L1 and L2 caches are verydifferent
Primary cache should focus on minimizing hit timein support ofa shorter clock cycle
- Smaller with smaller block sizes
Secondary cache(s) should focus on reducing miss rateto
reduce the penalty of long main memory access times- Larger with larger block sizes
The miss penalty of the L1 cache is significantly reducedby the presence of an L2 cacheso it can be smaller(i.e., faster) but have a higher miss rate
For the L2 cache, hit time is less important than miss rate
The L2$ hit time determines L1$s miss penalty
L2$ local miss rate >> than the global miss rate
-
8/10/2019 09 Cache Perf
16/24
CPE232 Improving Cache Performance 16
Key Cache Design Parameters
L1 typical L2 typical
Total size (blocks) 250 to 2000 4000 to250,000
Total size (KB) 16 to 64 500 to 8000
Block size (B) 32 to 64 32 to 128
Miss penalty (clocks) 10 to 25 100 to 1000
Miss rates(global for L2)
2% to 5% 0.1% to 2%
-
8/10/2019 09 Cache Perf
17/24
CPE232 Improving Cache Performance 17
Two Machines Cache Parameters
Intel P4 AMD Opteron
L1 organization Split I$ and D$ Split I$ and D$
L1 cache size 8KB for D$, 96KB fortrace cache (~I$)
64KB for each of I$ and D$
L1 block size 64 bytes 64 bytes
L1 associativity 4-way set assoc. 2-way set assoc.
L1 replacement ~ LRU LRU
L1 write policy write-through write-back
L2 organization Unified Unified
L2 cache size 512KB 1024KB (1MB)
L2 block size 128 bytes 64 bytes
L2 associativity 8-way set assoc. 16-way set assoc.
L2 replacement ~LRU ~LRU
L2 write policy write-back write-back
-
8/10/2019 09 Cache Perf
18/24
CPE232 Improving Cache Performance 18
4 Questions for the Memory Hierarchy
Q1: Where can a block be placed in the upper level?
(Block placement)
Q2: How is a block found if it is in the upper level?(Block identification)
Q3: Which block should be replaced on a miss?(Block replacement)
Q4: What happens on a write?(Write strategy)
-
8/10/2019 09 Cache Perf
19/24
CPE232 Improving Cache Performance 19
Q1&Q2: Where can a block be placed/found?
# of sets Blocks per setDirect mapped # of blocks in cache 1
Set associative (# of blocks in cache)/associativity
Associativity (typically2 to 16)
Fully associative 1 # of blocks in cache
Location method # of comparisons
Direct mapped Index 1
Set associative Index the set; comparesets tags
Degree ofassociativity
Fully associative Compare all blocks tags # of blocks
-
8/10/2019 09 Cache Perf
20/24
CPE232 Improving Cache Performance 20
Q3: Which block should be replaced on a miss?
Easy for direct mappedonly one choice
Set associative or fully associative Random
LRU (Least Recently Used)
For a 2-way set associative cache, randomreplacement has a miss rate about 1.1 times higherthan LRU.
LRU is too costly to implement for high levels of
associativity (> 4-way) since tracking the usageinformation is costly
-
8/10/2019 09 Cache Perf
21/24
CPE232 Improving Cache Performance 21
Q4: What happens on a write?
Write-throughThe information is written to both theblock in the cache and to the block in the next lower levelof the memory hierarchy
Write-through is always combined with a write buffer so writewaits to lower level memory can be eliminated (as long as thewrite buffer doesnt fill)
Write-backThe information is written only to the block inthe cache. The modified cache block is written to mainmemory only when it is replaced.
Need a dirty bit to keep track of whether the block is clean or dirty
Pros and cons of each? Write-through: read misses dont result in writes (so are simpler
and cheaper)
Write-back: repeated writes require only one write to lower level
-
8/10/2019 09 Cache Perf
22/24
CPE232 Improving Cache Performance 22
Improving Cache Performance
0. Reduce the time to hit in the cache
smaller cache
direct mapped cache
smaller blocks
for writes
- no write allocateno hit on cache, just write to write buffer
- write allocateto avoid two cycles (first check for hit, then write)pipeline writes via a delayed write buffer to cache
1. Reduce the miss rate
bigger cache more flexible placement (increase associativity)
larger blocks (16 to 64 bytes typical)
victim cachesmall buffer holding most recently discarded blocks
-
8/10/2019 09 Cache Perf
23/24
CPE232 Improving Cache Performance 23
Improving Cache Performance
2. Reduce the miss penalty smaller blocks
use a write buffer to hold dirty blocks being replaced so donthave to wait for the write to complete before reading
check write buffer (and/or victim cache) on read missmay getlucky
for large blocks fetch critical word first use multiple cache levelsL2 cache not tied to CPU clock rate
faster backing store/improved memory bandwidth
- wider buses
- memory interleaving, page mode DRAMs
-
8/10/2019 09 Cache Perf
24/24
CPE232 Improving Cache Performance 24
Summary: The Cache Design Space
Several interacting dimensions
cache size
block size
associativity
replacement policy
write-through vs write-back
write allocation
The optimal choice is a compromise
depends on access characteristics
- workload
- use (I-cache, D-cache, TLB)
depends on technology / cost
Simplicity often wins
Associativity
Cache Size
Block Size
Bad
Good
Less More
Factor A Factor B