Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4...
Transcript of Lecture 4 memory.ppt - University of Arkansasmqhuang/courses/2214/f2014/lectures/Lecture_4... · 4...
Chapter 5
Large and Fast: Exploiting Memory Hierarchy
Processor-Memory Performance Gap
1
10
100
1000
10000
1980
1983
1986
1989
1992
1995
1998
2001
2004
Year
Perf
orm
ance
“Moore’s Law”
µProc55%/year(2X/1.5yr)
DRAM7%/year(2X/10yrs)
Processor-MemoryPerformance Gap(grows 50%/year)
The “Memory Wall” Processor vs DRAM speed disparity continues
to grow
0.01
0.1
1
10
100
1000
VAX/1980 PPro/1996 2010+
CoreMemory
Clo
cks
per i
nstru
ctio
n
Clo
cks
per D
RA
M a
cces
s
Good memory hierarchy (cache) design is increasingly important to overall performance
4
Memory Technology
Static RAM (SRAM) 0.5ns – 2.5ns, $2000 – $5000 per GB
Dynamic RAM (DRAM) 50ns – 70ns, $20 – $75 per GB
Magnetic disk 5ms – 20ms, $0.20 – $2 per GB
Ideal memory Access time of SRAM Capacity and cost/GB of disk
Memory Hierarchy The Pyramid
5
SecondLevelCache
(SRAM)
A Typical Memory Hierarchy
Control
Datapath
SecondaryMemory(Disk)
On-Chip Components
RegFile
MainMemory(DRAM)D
ataC
acheInstr
Cache
ITLBD
TLB
Speed (cycles): ½’s 1’s 10’s 100’s 10,000’s
Size (bytes): 100’s 10K’s M’s G’s T’s
Cost: highest lowest
Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology while at the speed offered by the fastest technology
Inside the Processor
AMD Barcelona: 4 processor cores
8
Principle of Locality
Programs access a small proportion of their address space at any time
Temporal locality (locality in time) Items accessed recently are likely to be accessed
again soon e.g., instructions in a loop, induction variables
Keep most recently accessed items into the cache Spatial locality (locality in space)
Items near those accessed recently are likely to be accessed soon E.g., sequential instruction access, array data
Move blocks consisting of contiguous words closer to the processor
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 9
Taking Advantage of Locality
Memory hierarchy Store everything on disk Copy recently accessed (and nearby) items from disk
to smaller DRAM memory Main memory
Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory Cache memory attached to CPU
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 10
Memory Hierarchy Levels If accessed data is present in
upper level Hit: access satisfied by upper level
Hit ratio: hits/accesses Hit time:
If accessed data is absent Miss: data not in upper level
Miss ratio: misses/accesses= 1 – hit ratio
Miss penalty:
Block (aka cache line) Unit of copying May be multiple words
Time to access the block + Time to determine hit/miss
Time to access the block in the lower level + Time to transmit that block to the level that experienced the miss + Time to insert the block in that level + Time to pass the block to the requestor
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 11
Cache Memory
Cache memory The level of the memory hierarchy closest to the CPU
Given accesses X1, …, Xn–1, Xn
§5.3 The Basics of C
aches
How do we know if the data is present?
Where do we look?
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 12
Direct Mapped Cache
Location determined by address Direct mapped: only one choice
(Block address) modulo (#Blocks in cache)
#Blocks is a power of 2
Use low-order address bits
1 2 30 4 5 6 7
1 2 30 4 5 6 7 9 10 118 12 13 14 15 17 18 1916 20 21 22 23
1 2 30 4 5 6 7 1 2 30 4 5 6 7 1 2 30 4 5 6 7
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 13
Tags and Valid Bits
How do we know which particular block is stored in a cache location? Store block address as well as the data Actually, only need the high-order bits
Called the tag
What if there is no data in a location? Valid bit: 1 = valid, 0 = not valid
Initially 0
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 14
Cache Example
8-blocks, 1 word/block, direct mapped Initial state
Index V Tag Data0002(010) N0012(110) N0102(210) N0112(310) N1002(410) N1012(510) N1102(610) N1112(710) N
Access sequence (address in word): 22, 26, 22, 26, 16, 3, 16, 18, 16
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 15
Cache Example
Index V Tag Data000 N001 N010 N011 N100 N101 N110 Y 10 Mem[10110]111 N
Word addr Binary addr Hit/miss Cache block22 10 110 Miss 110
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 16
Cache Example
Index V Tag Data000 N001 N010 Y 11 Mem[11010]011 N100 N101 N110 Y 10 Mem[10110]111 N
Word addr Binary addr Hit/miss Cache block26 11 010 Miss 010
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 17
Cache Example
Index V Tag Data000 N001 N010 Y 11 Mem[11010]011 N100 N101 N110 Y 10 Mem[10110]111 N
Word addr Binary addr Hit/miss Cache block22 10 110 Hit 11026 11 010 Hit 010
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 18
Cache Example
Index V Tag Data000 Y 10 Mem[10000]001 N010 Y 11 Mem[11010]011 Y 00 Mem[00011]100 N101 N110 Y 10 Mem[10110]111 N
Word addr Binary addr Hit/miss Cache block16 10 000 Miss 0003 00 011 Miss 011
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 19
Cache Example
Index V Tag Data000 Y 10 Mem[10000]001 N010 Y 10 Mem[10010]011 Y 00 Mem[00011]100 N101 N110 Y 10 Mem[10110]111 N
Word addr Binary addr Hit/miss Cache block16 10 000 Hit 00018 10 010 Miss 010
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 20
Address Subdivision
Taking Advantage of Spatial Locality
0
Let cache block hold more than one word0 1 2 3 4 3 4 15
1 2
3 4 3
4 15
00 Mem(1) Mem(0)
miss
00 Mem(1) Mem(0)
hit
00 Mem(3) Mem(2)00 Mem(1) Mem(0)
miss
hit
00 Mem(3) Mem(2)00 Mem(1) Mem(0)
miss
00 Mem(3) Mem(2)00 Mem(1) Mem(0)
01 5 4hit
00 Mem(3) Mem(2)01 Mem(5) Mem(4)
hit
00 Mem(3) Mem(2)01 Mem(5) Mem(4)
00 Mem(3) Mem(2)01 Mem(5) Mem(4)
miss
11 15 14
Start with an empty cache - all blocks initially marked as not valid
8 requests, 4 misses
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 22
Larger Block Size
64 blocks, 16 bytes/block To what block number does address 1200 map? 120010=0x000004B0
Tag Index Offset03491031
4 bits6 bits22 bits
Multiword Block Direct Mapped Cache
8Index
DataIndex TagValid012...
253254255
31 …… 12 11 . . . 4 3 2 1 0 Byte offset
20
20Tag
Hit Data
32
Block offset
Four words/block, cache size = 1K words
What kind of locality are we taking advantage of?
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 24
Block Size Considerations
Larger blocks should reduce miss rate Due to spatial locality Spatial locality decreases if the block becomes very
large In a fixed-sized cache
Larger blocks fewer of them More competition increased miss rate
Larger miss penalty Can override benefit of reduced miss rate Early restart and critical-word-first can help
Miss Rate v.s. Block Size
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 25
Cache Field Sizes The number of bits in a cache includes the storage for
data, the tags, and the flag bits For a direct mapped cache with 2n blocks, n bits are used for the
index For a block size of 2m words (2m+2 bytes), m bits are used to
address the word within the block and 2 bits are used to address the byte within the word
What is the size of the tag field? The total number of bits in a direct-mapped cache is
then2n x (block size + tag field size + flag bits size)
How many total bits are required for a direct mapped cache with 16KB of data and 4-word blocks assuming a 32-bit address?
Read hits (I$ and D$) this is what we want!
Write hits (D$ only) require the cache and memory to be consistent
always write the data into both the cache block and the next level in the memory hierarchy (write-through)
allow cache and memory to be inconsistent write the data only into the cache block (write-back the cache
block to the next level in the memory hierarchy when that cache block is “evicted”)
Handling Cache Hits
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 28
Write-Through Write through: update the data in both the cache
and the main memory But makes writes take longer
Writes run at the speed of the next level in the memory hierarchy
e.g., if base CPI = 1, 10% of instructions are stores, write to memory takes 100 cycles Effective CPI = 1 + 0.1×100 = 11
Solution: write buffer Holds data waiting to be written to memory CPU continues immediately
Only stalls on write if write buffer becomes full
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 29
Write-Back
Alternative: On data-write hit, just update the block in cache Keep track of whether each block is dirty Need a dirty bit
When a dirty block is replaced Write it back to memory Can use a write buffer
30
Handling Cache Misses (single-word cache block)
Read misses (I$ and D$) Stall the CPU pipeline Fetch block from the next level of hierarchy Copy it to the cache Send the requested word to the processor Resume the CPU pipeline
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 31
What should happen on a write miss? Write allocate (aka: fetch on write): Data at the
missed-write location is loaded to cache For write-through: two alternatives
Allocate on miss: fetch the block, i.e., write allocate No write allocate: don’t fetch the block
Since programs often write a whole data structure before reading it (e.g., initialization)
For write-back Usually fetch the block, i.e., write allocate
Handling Cache Misses (single-word cache block)
Multiword Block Considerations Read misses (I$ and D$)
Processed the same as for single word blocks – a miss returns the entire block from memory
Miss penalty grows as block size grows Early restart – processor resumes execution as soon as the
requested word of the block is returned Requested word first – requested word is transferred from
the memory to the cache (and processor) first
Nonblocking cache – allows the processor to continue to access the cache while the cache is handling an earlier miss
Multiword Block Considerations Write misses (D$)
If using write allocate, must first fetch the block from memory and then write the word to the block or could end up with a “garbled” block in the cache, e.g., for
4-word blocks, a new tag, one word of data from the new block, and three words of data from the old block
If not using write allocate, forward the write request to the main memory
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 34
Main Memory Supporting Caches Use DRAMs for main memory Example:
cache block read 1 bus cycle for address transfer 15 bus cycles per DRAM access 1 bus cycle per data transfer
4-word cache block 3 different main memory configurations
1-word wide memory 4-word wide memory 4-bank interleaved memory
1-Word Wide Bus, 1-Word Wide Memory
What if the block size is four words and the bus and the memory are 1-word wide?
cycle to send address
cycles to read DRAM
cycles to return 4 data words
total clock cycles miss penalty
What if the block size is four words and the bus and the memory are 1-word wide?
cycle to send address
cycles to read DRAM
cycles to return 4 data words
total clock cycles miss penalty
15 cycles
15 cycles
15 cycles
15 cycles
14 x 15 = 60
4 65
Bandwidth = (4 x 4)/65 = 0.246 B/cycle
1-Word Wide Bus, 1-Word Wide Memory
What if the block size is four words and the bus and the memory are 4-word wide?
cycle to send address
cycles to read DRAM
cycles to return 4 data words
total clock cycles miss penalty
4-Word Wide Bus, 4-Word Wide Memory
What if the block size is four words and the bus and the memory are 4-word wide?
cycle to send address
cycles to read DRAM
cycles to return 4 data words
total clock cycles miss penalty
15 cycles
1151
17
Bandwidth = (4 x 4)/17 = 0.941 B/cycle
4-Word Wide Bus, 4-Word Wide Memory
Interleaved Memory, 1-Word Wide Bus For a block size of four words
cycle to send addresscycles to read DRAM bankscycles to return 4 data wordstotal clock cycles miss penalty
Interleaved Memory, 1-Word Wide Bus For a block size of four words
cycle to send addresscycles to read DRAM bankscycles to return 4 data wordstotal clock cycles miss penalty
15 cycles
15 cycles
15 cycles
15 cycles
Bandwidth = (4 x 4)/20 = 0.8 B/cycle
1
154*1 = 4
20
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 41
Advanced DRAM Organization
Bits in a DRAM are organized as a rectangular array DRAM accesses an entire row Burst mode: supply successive words from a row with
reduced latency Double data rate (DDR) DRAM
Transfer on rising and falling clock edges Quad data rate (QDR) DRAM
Separate DDR inputs and outputs
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 42
Measuring Cache Performance Components of CPU time
Program execution cycles Includes cache hit time
Memory stall cycles Mainly from cache misses
With simplifying assumptions:
§5.4 Measuring and Im
proving Cache P
erformance
penalty MissnInstructio
MissesProgram
nsInstructio
penalty Missrate MissProgram
accessesMemory cycles stallMemory Total
CPU timeCPU time = IC × CPI × CC
= IC × (CPIideal + Average Memory-stall cycles) × CC
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 43
Cache Performance Example Given
I-cache miss rate = 2% D-cache miss rate = 4% Miss penalty = 100 cycles Base CPI (ideal cache) = 2 Load & stores are 36% of instructions
Miss cycles per instruction I-cache: 0.02 × 100 = 2
Need to fetch every instruction from I-cache D-cache: 0.36 × 0.04 × 100 = 1.44
36% instructions need to access D-cache Actual CPI = 2 + 2 + 1.44 = 5.44
Ideal CPU is 5.44/2 =2.72 times faster
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 44
Average Memory Access Time
Hit time is also important for performance Average memory access time (AMAT)
AMAT = Hit time + Miss rate × Miss penalty Example
CPU with 1ns clock, hit time = 1 cycle, miss penalty = 20 cycles, I-cache miss rate = 5%
AMAT = 1 + 0.05 × 20 = 2ns 2 cycles per instruction
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 45
Performance Summary
When CPU performance increased Miss penalty becomes more significant Greater proportion of time spent on memory stalls
Decreasing base CPI
Increasing clock rate Memory stalls account for more CPU cycles
Can’t neglect cache behavior when evaluating system performance
Mechanisms to Reduce Cache Miss Rates
Allow more flexible block placement Use multiple levels of caches
46
Reducing Cache Miss Rates #11. Allow more flexible block placement In a direct mapped cache a memory block
maps to exactly one cache block At the other extreme, could allow a memory
block to be mapped to any cache block – fully associative cache
A compromise is to divide the cache into sets,each of which consists of n “ways” (n-way set associative) A memory block maps to a unique set (specified by
the index field) and can be placed in any way of that set (so there are n choices)
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 48
Associative Caches
Fully associative (only one set) Allow a given block to go in any cache entry Requires all entries to be searched at once Comparator per entry (expensive)
n-way set associative Each set contains n entries Block address determines which set
(Block address) modulo (#Sets in cache)
Search all entries in a given set at once n comparators (less expensive)
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 49
Associative Cache Example
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 50
Spectrum of Associativity
For a cache with 8 entries
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 51
Associativity Example
Compare 4-block caches Direct mapped, 2-way set associative,
fully associative Block address sequence: 0, 8, 0, 6, 8
Direct mapped
Block address
Cache index
Hit/miss Cache content after access0 1 2 3
0 0 miss Mem[0]8 0 miss Mem[8]0 0 miss Mem[0]6 2 miss Mem[0] Mem[6]8 0 miss Mem[8] Mem[6]
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 52
Associativity Example
2-way set associativeBlock
addressCache index
Hit/miss Cache content after accessSet 0 Set 1
0 0 miss Mem[0]8 0 miss Mem[0] Mem[8]0 0 hit Mem[0] Mem[8]6 0 miss Mem[0] Mem[6]8 0 miss Mem[8] Mem[6]
Fully associativeBlock
addressHit/miss Cache content after access
0 miss Mem[0]8 miss Mem[0] Mem[8]0 hit Mem[0] Mem[8]6 miss Mem[0] Mem[8] Mem[6]8 hit Mem[0] Mem[8] Mem[6]
Another Reference String Mapping
0 4 0 4
0 4 0 4
Consider the main memory word reference string0 4 0 4 0 4 0 4
miss miss miss miss
miss miss miss miss
00 Mem(0) 00 Mem(0)01 4
01 Mem(4)000
00 Mem(0)01
4
00 Mem(0)01 4
00 Mem(0)01
401 Mem(4)
00001 Mem(4)
000
Start with an empty cache - all blocks initially marked as not valid
Ping pong effect due to conflict misses - two memory locations that map into the same cache block
8 requests, 8 misses
Another Reference String Mapping
0 4 0 4
Consider the main memory word reference string0 4 0 4 0 4 0 4Start with an empty cache - all
blocks initially marked as not valid
Another Reference String Mapping
0 4 0 4
Consider the main memory word reference string0 4 0 4 0 4 0 4
miss miss hit hit
000 Mem(0) 000 Mem(0)
Start with an empty cache - all blocks initially marked as not valid
010 Mem(4) 010 Mem(4)000 Mem(0) 000 Mem(0)
010 Mem(4)
Solves the ping pong effect in a direct mapped cache due to conflict misses since now two memory locations that map into the same cache set can co-exist!
8 requests, 2 misses
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 56
How Much Associativity
Increased associativity decreases miss rate But with diminishing returns
Simulation of a system with 64KBD-cache, 16-word blocks, SPEC2000 1-way: 10.3% 2-way: 8.6% 4-way: 8.3% 8-way: 8.1%
Benefits of Set Associative Caches Largest gains are in going from direct mapped cache to
2-way set associative cache (20+% reduction in miss rate)
57
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 58
Set Associative Cache Organization
Range of Set Associative Caches For a fixed size cache, each increase by a factor of two
in associativity doubles the number of blocks per set (i.e., the number of ways) and halves the number of sets – decreases the size of the index by 1 bit and increases the size of the tag by 1 bit
Block offset Byte offsetIndexTag
Range of Set Associative Caches For a fixed size cache, each increase by a factor of two
in associativity doubles the number of blocks per set (i.e., the number of ways) and halves the number of sets – decreases the size of the index by 1 bit and increases the size of the tag by 1 bit
Block offset Byte offsetIndexTag
Decreasing associativity
Fully associative(only one set)Tag is all the bits exceptblock and byte offset
Direct mapped(only one way)Smaller tags, only a single comparator
Increasing associativity
Selects the setUsed for tag compare Selects the word in the block
Example: Tag Size
Cache 4K blocks 4-word block size
32-bit byte address Total number of tag bits for the cache (?)
Directed mapped 2-way set associative 4-way set associative Fully associative
61
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 62
Replacement Policy Direct mapped: no choice Set associative
Prefer non-valid entry, if there is one Otherwise, choose among entries in the set
Least-recently used (LRU) Choose the one unused for the longest time
Simple for 2-way, manageable for 4-way, too hard beyond that
Random Gives approximately the same performance as LRU
for high associativity
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 63
Reducing Cache Miss Rates #2
2. Use multiple levels of caches
Primary (L1) cache attached to CPU Small, but fast Separate L1 I$ and L1 D$
Level-2 cache services misses from L1 cache Larger, slower, but still faster than main memory Unified cache for both instructions and data
Main memory services L-2 cache misses Some high-end systems include L-3 cache
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 64
Multilevel Cache Example
Given CPU base CPI = 1, clock rate = 4GHz Miss rate/instruction = 2% Main memory access time = 100ns
With just primary cache Miss penalty = 100ns/0.25ns = 400 cycles Effective CPI = 1 + 0.02 × 400 = 9
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 65
Example (cont.)
Now add L-2 cache Access time = 5ns Global miss rate to main memory = 0.5%
Primary miss with L-2 hit Penalty = 5ns/0.25ns = 20 cycles
Primary miss with L-2 miss Extra penalty = 400 cycles
CPI = 1 + 0.02 × 20 + 0.005 × 400 = 3.4 Performance ratio = 9/3.4 = 2.6
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 66
Multilevel Cache Considerations
Primary cache Focus on minimal hit time Smaller total size with smaller block size
L-2 cache Focus on low miss rate to avoid main memory access Hit time has less overall impact Larger total size with larger block size Higher levels of associativity
Global v.s. Local Miss Rate
Global miss rate (GR) The fraction of references that miss in all levels of a
multilevel cache Dictate how often the main memory is accessed
Global miss rate so far (GRS) The fraction of references that miss up to a certain
level in a multilevel cache Local miss rate (LR)
The fraction of references to one level of a cache that miss
L2$ local miss rate >> the global miss rate
67
Example
LR1=5%, LR2=20%, LR3=50% GR=LR1×LR2×LR3=0.05×0.2×0.5=0.005 GRS1=LR1=0.05 GRS2=LR1×LR2=0.05×0.2=0.01 GRS3=LR1×LR2×LR3=0.05×0.2×0.5=0.005 CPI=1+GSR1×Pty1+GSR2×Pty2+GSR3×Pty3
=1+0.05×10+0.01×20+0.005×100=2.2
68
Sources of Cache Misses Compulsory (cold start or process migration, first
reference): First access to a block, “cold” fact of life, not a whole lot you
can do about it. If you are going to run “millions” of instruction, compulsory misses are insignificant
Solution: increase block size (increases miss penalty; very large blocks could increase miss rate)
Capacity: Cache cannot contain all blocks accessed by the program Solution: increase cache size (may increase access time)
Conflict (collision): Multiple memory locations mapped to the same cache location Solution 1: increase cache size Solution 2: increase associativity (may increase access time)
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 70
Cache Design Trade-offs
Design change Effect on miss rate Negative performance effect
Increase cache size Decrease capacity misses
May increase access time
Increase associativity Decrease conflict misses
May increase access time
Increase block size Decrease compulsory misses
Increases miss penalty. For very large block size, may increase miss rate due to pollution.
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 71
Virtual Memory A memory management technique developed
for multitasking computer architectures Virtualize various forms of data storage Allow a program to be designed as there is only one
type of memory, i.e., “virtual” memory Each program runs on its own virtual address space
Use main memory as a “cache” for secondary (disk) storage Allow efficient and safe sharing of memory among
multiple programs Provide the ability to easily run programs larger than
the size of physical memory Simplify loading a program for execution by
providing for code relocation
§5.7 Virtual Mem
ory
Two Programs Sharing Physical Memory
Program 1virtual address space
main memory
A program’s address space is divided into pages (fixed size) The starting location of each page (either in main memory or in
secondary memory) is contained in the program’s page table
Program 2virtual address space
“cache” of hard drive
Address Translation
Virtual Address (VA)
Page offsetVirtual page number
31 30 . . . 12 11 . . . 0
Page offsetPhysical page number
Physical Address (PA)29 . . . 12 11 0
Translation
So each memory request first requires an address translation from the virtual space to the physical space A virtual memory miss (i.e., when the page is not in physical
memory) is called a page fault
A virtual address is translated to a physical address by a combination of hardware and software
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 74
Page Fault Penalty
On page fault, the page must be fetched from disk Takes millions of clock cycles Handled by OS code
Try to minimize page fault rate Fully associative placement
A physical page can be placed at any place in the main memory
Smart replacement algorithms
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 75
Page Tables Stores placement information
Array of page table entries, indexed by virtual page number
Page table register in CPU points to page table in physical memory
If page is present in memory Page table entry stores the physical page number Plus other status bits (referenced, dirty, …)
If page is not present Page table entry can refer to location in swap space
on disk Swap space: the space on the disk reserved for the full virtual
memory space of a process
Address Translation Mechanisms
Physical pagebase addr
Main memory
Disk storage
Virtual page #
V11111101010
Page Table(in main memory)
Offset
Physical page #
Offset
Pag
e ta
ble
regi
ster
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 77
Translation Using a Page Table
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 78
Replacement and Writes To reduce page fault rate, prefer least-recently
used (LRU) replacement Reference bit (aka use bit) in PTE set to 1 on access
to page Periodically cleared to 0 by OS A page with reference bit = 0 has not been used
recently Disk writes take millions of cycles
Replace a page at once, not individual locations Use write-back
Write through is impractical Dirty bit in PTE set when page is written
Virtual Addressing with a Cache Thus it takes an extra memory access to
translate a VA to a PA
CPU Trans-lation Cache Main
Memory
VA PA miss
hitdata
This makes memory (cache) accesses very expensive (if every access was really two accesses)
The hardware fix is to use a Translation Lookaside Buffer (TLB) – a small cache that keeps track of recently used address mappings to avoid having to do a page table lookup
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 80
Fast Translation Using a TLB
Address translation would appear to require extra memory references One to access the PTE Then the actual memory access
Check cache first
But access to page tables has good locality So use a fast cache of PTEs within the CPU Called a Translation Look-aside Buffer (TLB) Typical: 16–512 PTEs, 0.5–1 cycle for hit, 10–100
cycles for miss, 0.01%–1% miss rate Misses could be handled by hardware or software
Making Address Translation Fast
Physical pagebase addr
Main memory
Disk storage
Virtual page #
V11111101010
11101
TagPhysical page
base addrV
TLB
Page Table(in physical memory)
Pag
e ta
ble
regi
ster
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 82
Fast Translation Using a TLB
Translation Lookaside Buffers (TLBs)
Just like any other cache, the TLB can be organized as fully associative, set associative, or direct mapped
V Tag Physical Page # Dirty Ref Access
TLB access time is typically smaller than cache access time (because TLBs are much smaller than caches) TLBs are typically not more than 512 entries even on high end
machines
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 84
TLB Misses If page is in memory
Load the PTE from page table to TLB and retry Could be handled in hardware
Can get complex for more complicated page table structures Or handled in software
Raise a special exception, with optimized handler
If page is not in memory (page fault) OS handles fetching the page and updating the page
table Then restart the faulting instruction
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 85
TLB Miss Handler
TLB miss indicates two possibilities Page present, but PTE not in TLB
Handler copies PTE from memory to TLB Then restarts instruction
Page not preset in main memory Page fault will occur
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 86
Page Fault Handler
1. Use virtual address to find PTE2. Locate page on disk3. Choose page to replace if necessary
If the chosen page is dirty, write to disk first4. Read page into memory and update page table5. Make process runnable again
Restart from faulting instruction
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 87
TLB and Cache Interaction If cache tag uses
physical address Need to translate
before cache lookup
Alternative: use virtual address tag Complications due to
aliasing Different virtual
addresses for shared physical address
TLB Event CombinationsTLB Page
TableCache Possible? Under what circumstances?
Hit Hit HitHit Hit Miss
Miss Hit HitMiss Hit Miss
Miss Miss MissHit Miss Miss/
HitMiss Miss Hit
Yes – what we want!
Yes – although the page table is not checked if the TLB hits
Yes – TLB miss, PA in page table
Yes – TLB miss, PA in page table, but datanot in cache
Yes – page faultImpossible – TLB translation not possible ifpage is not present in memory
Impossible – data not allowed in cache if page is not in memory
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 89
Memory Protection
Different tasks can share parts of their virtual address spaces But need to protect against errant access Requires OS assistance
Hardware support for OS protection Privileged supervisor mode (aka kernel mode) Privileged instructions
Page tables and other state information only accessible in supervisor mode
System call exception (e.g., syscall in MIPS)
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 90
Concluding Remarks Fast memories are small, large memories are
slow We really want fast, large memories Caching gives this illusion
Principle of locality Programs use a small part of their memory space
frequently Memory hierarchy
L1 cache L2 cache … DRAM memory disk
Memory system design is critical for multiprocessors
§5.16 Concluding R
emarks